Jagger is a fast, accurate, and space-efficient morphological analyzer [1] inspired by the dictionary-based longest matching for tokenization and the precomputation of machine-learning classifiers. Jagger applies patterns, which are extracted from morphological dictionaries and training data, to input from the beginning to simultaneously and deterministically perform tokenization, POS tagging, and lemmatization. Jagger can process more than 3.2 million sentences (83 million words) per second on a single CPU (M4 MacBook Air) with an accuracy comparable to existing practical implementations of morphological analyzers based on the Viterbi algorithm [2] and pointwise estimation [3].
If you make use of Jagger for research or commercial purposes, the reference will be:
Naoki Yoshinaga
Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie
The 61st Annual Meeting of the Association for Computational Linguistics (ACL-23). Toronto, Canada. July 2023
Refer to the slide and poster presented at ACL-23 for details of algorithms.
> wc src/*.{cc,h}
89 608 4475 src/jagger.cc
206 1736 13573 src/train_jagger.cc
215 1608 10691 src/ccedar_core.h
155 1164 7167 src/jagger.h
665 5116 35906 totalLicense: GNU GPLv2, LGPLv2.1, BSD
> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/jagger-latest.tar.gz
> tar zxvf jagger-latest.tar.gz
> cd jagger-YYYY-MM-DD
# 1) prepare a dictionary in the format compatible with mecab-jumandic (cf. mecab-jumandic-7.0-20130310.tar.gz)
> tar zxvf mecab-jumandic-7.0-20130310.tar.gz
> patch -p0 < mecab-jumandic-7.0-20130310.patch # correct corrupted characters in AuxV.csv
# 2) Use the Kyoto University Web Document Leads Corpus (default)
> git clone https://github.com/ku-nlp/KWDLC
> configure
# 2') Or use the Kyoto University Text Corpus
> git clone https://github.com/ku-nlp/KyotoCorpus
> cd KyotoCorpus; auto_conv -d PATH_TO_MAINICHI_NEWS_DIR; cd ..
> configure --with-corpus=kyoto
# 3) Train a model from the standard split, evaluate the resulting model, and then install
> make model-benchmark && make install
# 3') To train a model using your own morphological dictionary and training data and then evaluate the resulting model on your test data
> make install
> train_jagger -d DICT_FILE TRAIN_FILE_WITH_POS > PATTERN_DIR/patterns
> jagger -m PATTERN_DIR [-wf] < TEST_FILE > result.JAG
> eval.py result.JAG TEST_FILE_WITH_POS
Available resources:
--enable-compact-dict option from configure-f option in jagger (enable full buffering if ::isatty (0) == 0)Typing jagger -h in the command line shows the following usage. By default, Jagger reads the model trained with the dictionary and training data specified at the installation.
jagger: Pattern-based Japanese Morphological Analyzer
Copyright (c) 2023-present Naoki Yoshinaga. All rights reserved.
Usage: src/jagger [OPTIONS] < input
Options:
-m DIR Directory for compiled patterns (default: JAGGER_DEFAULT_MODEL)
-w Perform only segmentation
If you add the -w option, Jagger performs only tokenization.
Typing train_jagger in the command line will show the following usage.
train_jagger: Extract patterns for Jagger from dictionary and training data
Copyright (c) 2023-present Naoki Yoshinaga. All rights reserved.
Usage: src/train_jagger [OPTIONS] train
Options:
-m DIR Directory to store patterns
-d FILE Dictionary file in CSV format
-u FILE User-defined dictionary file in CSV format
dict, user_dict are dictionaries in the format compatible with MeCab (jumandic); note that Jagger will ignore cost parameters etc. You may want to fill them with 0; the number of fields just matters. train should be an annotated corpus in the same format as Jagger (MeCab)'s outputs.
Please either use an older version or provide partially annotated training examples that include the target morpheme and its surrounding context to train_jagger. I plan to add functionality for dynamically modifying patterns at inference time in the future.
See the reference [1] for the comparison on the pre-release version.
For those who want to use Jagger in programming languages other than C++, the following third-party contributions to ports and bindings (wrappers) are available.
Except for the algorithm proposed in [1], we have not verified whether the other optimization techniques are free from patent restrictions. Since these techniques are widely implemented in existing open-source software, we believe there should be no practical issues; however, please use this software at your own risk, especially for commercial purposes.
The development of this software is partially supported by JSPS KAKENHI Grant Number JP21H03494 and JST CREST JPMJCR19A4, Japan.