Jagger - C++ implementation of Pattern-based Japanese Morphological Analyzer

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo

(Japanese page is here)

About

Jagger is a fast, accurate, and space-efficient morphological analyzer [1] inspired by the dictionary-based longest matching for tokenization and the precomputation of machine-learning classifiers. Jagger applies patterns, which are extracted from morphological dictionaries and training data, to input from the beginning to jointly and deterministically perform tokenization, POS tagging, and lemmatization. Jagger can perform morphological analysis at more than 1,000,000 sentences (25,000,000 words) per second on a single CPU (M2 MacBook Air) with an accuracy comparable to existing practical implementations of a morphological analyzer based on the Viterbi algorithm [2] and pointwise estimation [3].

If you make use of Jagger for research or commercial purposes, the reference will be:

Naoki Yoshinaga
Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie
The 61st Annual Meeting of the Association for Computational Linguistics (ACL-23). Toronto, Canada. July 2023

Refer to the slide and poster presented at ACL-23 for details of algorithms.

Features

Efficient: Jagger can perform morphological analysis 7-16x faster than existing practical implementations such as MeCab, Vibrato, and Vaporetto (for tokenization, Jagger is 3-4x faster than Vaporetto and 21x faster than MeCab). Specifically, Jagger can process 1,500,000 sentences of Web text and 1,000,000 sentences of news articles per second on M2 MacBook Air (for tokenization, 1,900,000 and 1,200,000 sentences per second, respectively).
Accurate: With the same dictionary and training data, Jagger is as accurate as the existing practical implementations of morphological analyzers. When abundant training data is available in the target domain, the accuracy will be Vaporetto > Jagger > MeCab (Kyoto University Text Corpus). Otherwise, MeCab > Jagger ≅ Vaporetto (Kyoto University Web Document Leads Corpus). For out-of-domain text, the accuracy will be MeCab > Jagger > Vaporetto. Note that the accuracy of the individual implementations will depend on the design of features, the quality and size of the dictionary and the training data.
Space-efficient: Jagger runs with 1/2 to 1/20 main memory of the existing practical implementations. The memory consumption is suppressed by the learning-free design of the algorithm while adopting the common implementation tricks such as zero-copy of strings, memory-mapped I/O, frequency-based character mapping, and character-based double arrays. Specifically, when Jagger is trained with MeCab-jumandic and the above standard corpora, it requires only 40MiB at run time. If you add the configure option, --compact-dict, in configuring Jagger, it splits morphological information into lexical and non-lexical portions and thereby reduces memory consumption with tolerable degradation in speed.
Quick training: Training a model for Jagger is fast since it does not leverage machine learning algorithms. Specifically, on M2 MacBook Air, it requires 6 and 3 seconds to train a model from the standard split of Kyoto University Text Corpus and Kyoto University Web Document Leads Corpus.
Customization: You can easily customize the behavior of Jagger by adding not only dictionary entries and training data but also patterns directly.
MeCab-compatible output: Jagger outputs results in a format compatible with the standard morphological analyzer, MeCab.
Portability: Jagger does not depend on any third-party libraries such as machine learning algorithms and is implemented by the classical standard of C++; it can be compiled in environments with old C++ compilers such as GCC 4.6.
Simple implementations: The C++ implementation of Jagger is concise and has just 864 lines. This includes an implementation of the dynamic double array, cedar, which is modified to use characters instead of bytes in transition; the number of lines of the entire program is less than 600. You can read it.
```
> wc src/*.{cc,h}
     107     615    4725 src/jagger.cc
     171    1069    6713 src/jagger.h
     266    1968   15854 src/train_jagger.cc
     320    2092   14254 src/ccedar_core.h
     864    5744   41546 totali
```

License: GNU GPLv2, LGPLv2.1, BSD

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/jagger-latest.tar.gz
> tar zxvf jagger-latest.tar.gz
> cd jagger-YYYY-MM-DD

# 1) prepare a dictionary in the format compatible with mecab-jumandic (cf. mecab-jumandic-7.0-20130310.tar.gz)
> tar zxvf mecab-jumandic-7.0-20130310.tar.gz
> patch -p0 < mecab-jumandic-7.0-20130310.patch # correct gabled text in AuxV.csv

# 2) Use the Kyoto University Web Document Leads Corpus (default)
> git clone https://github.com/ku-nlp/KWDLC
> configure

# 2') Or use the Kyoto University Text Corpus
> git clone https://github.com/ku-nlp/KyotoCorpus
> cd KyotoCorpus; auto_conv -d PATH_TO_MAINICHI_NEWS_DIR; cd ..
> configure --with-corpus=kyoto

# 3) Train a model from the standard split, evaluate the resulting model, and then install
> make model-benchmark && make install

# 3') To train a model using your own morphological dictionary and training data and then evaluate the resulting model on your test data
> make install
> train_jagger -d DICT_FILE TRAIN_FILE_WITH_POS > PATTERN_DIR/patterns
> jagger -m PATTERN_DIR [-wf] < TEST_FILE > result.JAG
> eval.py result.JAG TEST_FILE_WITH_POS

Available resources:

Dictionary: MeCab downloads
Patch: mecab-jumandic-7.0-20130310.patch
Annotated corpora for morphological analysis:
- Kyoto University Text Corpus (requires Mainichi 1995 CD-ROM)
- Kyoto University Web Document Leads Corpus

ToDo

Support dynamic addition of user-defined patterns
Elaborating the pattern template to make Jagger even faster (depends on my willingness)
Training from partial annotations (easy)
Using a static double array to make Jagger faster and more compact (hard)
Integrating Jagger into an efficient dependency parser, J.DepP

History

March 14th, 2024 (under development; subject to minor bug/typo/comment fixes):
- Add handling of unknown words on symbols (Emoji etc.)
- Support user-defined dictionary
- Support processing of unbounded input (without newline)
- Reduce the memory footprint in training
- Remove -f option in jagger (enable full buffering if ::isatty (0) == 0)
- Move compilation of patterns from testing to training
- Fix a bug in printing lemma for unknown words
- Fix minor bugs related to memory access
- Improve the readability of code by removing bit shifting operators using bit field
February 18th, 2023:
- pre-release.

Usage

Tagging

Typing jagger -h in the command line shows the following usage. By default, Jagger read the model trained with the dictionary and training data specified at the installation.

jagger: Pattern-based Jappanese Morphological Analyzer
Copyright (c) 2023- Naoki Yoshinaga, All rights reserved.

Usage: src/jagger [-m dir w] < input

Options:
 -m dir	directory for compiled patterns (default: JAGGER_DEFAULT_MODEL)
 -w	perform only segmentation

If you add the -w option, Jagger performs only tokenization.

Training

Typing train_jagger in the command line will show the following usage.

train_jagger: Extract patterns for Jagger from dictionary and training data
Copyright (c) 2023- Naoki Yoshinaga, All rights reserved.

Usage: src/train_jagger [-m dir -d dict -u dict] train

Options:
 -m dir 	directory to store patterns
 -d dict	dictionary in CSV format
 -u user_dict	user-defined dictionary in CSV format

dict, user_dict are dictionaries in the format compatible with MeCab (jumandic); note that Jagger will ignore cost parameters etc. You may want to fill them with 0; the number of fields just matters. train should be an annotated corpus in the same format as Jagger (MeCab)'s outputs.

How to add user-defined patterns

Please use the previous version or add a partial annotated example that includes the target morpheme and surrounding contexts to the training data given to . I plan to implement a functionality to dynamically edit patterns in the inference in the future,

Performance Comparison

See the reference [1].

Third-Party Contributions

For those who want to use Jagger in prgoramming languages other than C++, the following third-party contributions to ports and bindings (wrappers) are available.

RcppJagger (R wrapper by Shusei Eshima)

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.

Acknowledgments

The development of this software is partially supported by JSPS KAKENHI Grant Number JP21H03494 and JST CREST JPMJCR19A4, Japan.

References

Naoki Yoshinaga. Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie. ACL-23. 2023
Taku Kudo and Yuji Matsumoto. Applying Conditional Random Fields to Japanese Morphological Analysis. EMNLP-04. 2004
Graham Neubig, Yosuke Nakata, and Shinsuke Mori. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis. ACL-11. 2011

last-modified: Jun 05 19:22:50 2025, written by XHTML 1.1