Jagger - C++ implementation of Pattern-based Japanese Morphological Analyzer

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo
(Japanese page is here)
Skip to [ Features | Download | History | Usage | Performance | Contributions | References ]

About

Jagger is a fast, accurate, and space-efficient morphological analyzer [1] inspired by the dictionary-based longest matching for tokenization and the precomputation of machine-learning classifiers. Jagger applies patterns, which are extracted from morphological dictionaries and training data, to input from the beginning to jointly and deterministically perform tokenization, POS tagging, and lemmatization. Jagger can perform morphological analysis at more than 1,000,000 sentences per second on a single CPU (M2 MacBook Air) with an accuracy comparable to existing practical implementations of a morphological analyzer based on the Viterbi algorithm [2] and pointwise estimation [3].

If you make use of Jagger for research or commercial purposes, the reference will be:

Naoki Yoshinaga
Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie
The 61st Annual Meeting of the Association for Computational Linguistics (ACL-23). Toronto, Canada. July 2023

Refer to the slide and poster presented at ACL-23 for details of algorithms.

Features

License: GNU GPLv2, LGPLv2.1, BSD

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jagger/jagger-latest.tar.gz
> tar zxvf jagger-latest.tar.gz
> cd jagger-YYYY-MM-DD

# 1) prepare a dictionary in the format compatible with mecab-jumandic (cf. mecab-jumandic-7.0-20130310.tar.gz)
> tar zxvf mecab-jumandic-7.0-20130310.tar.gz
> patch -p0 < mecab-jumandic-7.0-20130310.patch # correct gabled text in AuxV.csv

# 2) Use the Kyoto University Web Document Leads Corpus (default)
> git clone https://github.com/ku-nlp/KWDLC
> configure

# 2') Or use the Kyoto University Text Corpus
> git clone https://github.com/ku-nlp/KyotoCorpus
> cd KyotoCorpus; auto_conv -d PATH_TO_MAINICHI_NEWS_DIR; cd ..
> configure --with-corpus=kyoto

# 3) Train a model from the standard split, evaluate the resulting model, and then install
> make model-benchmark && make install

# 3') To train a model using your own morphological dictionary and training data and then evaluate the resulting model on your test data
> make install
> train_jagger -d DICT_FILE TRAIN_FILE_WITH_POS > PATTERN_DIR/patterns
> jagger -m PATTERN_DIR [-wf] < TEST_FILE > result.JAG
> eval.py result.JAG TEST_FILE_WITH_POS

Available resources:

ToDo

History

Usage

Tagging

Typing jagger -h in the command line shows the following usage. By default, Jagger read the model trained with the dictionary and training data specified at the installation.

jagger: Pattern-based Japanese Morphological Analyzer
Usage: jagger -m dir [-wf] < input

Options:
 -m dir	pattern directory
 -w	perform only segmentation
 -f	full buffering (fast but not interactive)

If you add the -w option, Jagger performs only tokenization. Option -f is meant to use block IO for faster execution. You may want to omit this option when you interactively perform morphological analysis in the command line.

Training

Typing train_jagger in the command line will show the following usage.

train_jagger:  extract patterns for Jagger from the dictionary and training data
Usage: train_jagger -d dict train > patterns

Options:
 -d dict	dictionary csv

dict is a dictionary in the format compatible with MeCab (jumandic); note that Jagger will ignore cost parameters etc. You may want to fill them with 0; the number of fields just matters. train should be an annotated corpus in the same format as Jagger (MeCab)'s outputs.

How to add user ditionaries in training

Add your own dictionary items directly to the end of a dictionary file specified by the option -d. When the sufrace strings of the added items appear with different part-of-speech tags, the added items will be ignorred.

How to add user patterns (dictionaries) in testing

Remove compiled patterns patterns.{c2i,da,fs,p2f} in a pattern directory specified by -m and add your own dictionaries/patterns directly to the pattern file patterns in the same pattern directory. The pattern format is

Pattern Count\Following Surface\tPreceding POS\tPosition to Segment\tChar. Type of Following Surface\tFeatures
(set the pattern count to 0 and the character type of the following surface string to 1 (number), 2 (alphabet), 3 (katakana), or 4 (other)). When there are multiple patterns with thte same Following Surface and Previous POS, they are overwritten by the last pattern.

I plan to simplify these methods in the future.

Performance Comparison

See the reference [1].

Third-Party Contributions

For those who want to use Jagger in prgoramming languages other than C++, the following third-party contributions to ports and bindings (wrappers) are available.

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.

Acknowledgments

The development of this software is partially supported by JSPS KAKENHI Grant Number JP21H03494 and JST CREST JPMJCR19A4, Japan.

References

  1. Naoki Yoshinaga. Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie. ACL-23. 2023 (to appear)
  2. Taku Kudo and Yuji Matsumoto. Applying Conditional Random Fields to Japanese Morphological Analysis. EMNLP-04. 2004
  3. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis. ACL-11. 2011

Copyright © 2023 Naoki Yoshinaga, All right Reserved.
last-modified: Jan 17 15:19:58 2024, written by XHTML 1.1