J.DepP - C++ implementation of Japanese Dependency Parsers

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo
Skip to [ Features | Download | History | Usage | Performance | References ]

About

J.DepP is a C++ implementation of Japanese dependency parsing algorithms [1,2,3,4]. It takes a raw sentence as input and performs word segmentation, POS tagging (thanks to MeCab), bunsetsu chunking and dependency parsing. Syntactic parsers have been believed to be (significantly) slower than front-end part-of-speech taggers, and it is rarely utilized in industry that needs to handle massive texts (e.g., microblogs). The inefficiency of parsers is, however, just because researchers paid attention mostly to accuracy and have not seriously pursued an efficient implementation. J.DepP is meant for those who want to parse massive texts (e.g., entire blog feeds or microblogs); J.DepP is even faster than most of the front-end morphological analyzer (parsing >10000 sentences per second), while achieving the state-of-the-art parsing accuracy.

If you make use of J.DepP for research or commercial purposes, the reference (optional) will be:

N. Yoshinaga and M. Kitsuregawa. A Self-adaptive Classifier for Efficient Text-stream Processing. Proc. COLING 2014, pp. 1091--1102. 2014 (used for testing a parser)
N. Yoshinaga and M. Kitsuregawa. Kernel Slicing: Scalable Online Training with Conjunctive Features. Proc. COLING 2010 (oral), pp. 1245--1253. 2010. (used for training a parser)
N. Yoshinaga and M. Kitsuregawa. Polynomial to Linear: Efficient Classification with Conjunctive Features. Proc. EMNLP 2009, pp. 1542--1551. 2009. A longer journal version is here. (used for testing a parser)

Features

License: GNU GPLv2, LGPLv2.1, and BSD; or e-mail me for other licenses you want.

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/jdepp-latest.tar.gz
> tar zxvf jdepp-latest.tar.gz
> cd jdepp-YYYY-MM-DD

# 1) train a parser with Kyoto-University and NTT Blog (KNB) Corpus (default)
> configure

# or train a parser with Kyoto University Text Corpus (KyotoCorpus4.0 required)
> configure --with-corpus=kyoto
> ln -s PATH_TO_KYOTO_CORPUS/KyotoCorpus4.0

# or train a parser with Kyoto University Text Corpus (KyotoCorpus4.0 required)
#  and KNB Corpus (CaboCha seems to use these corpora for training)
> configure --with-corpus=kyoto+knbc
> ln -s PATH_TO_KYOTO_CORPUS/KyotoCorpus4.0

# or train a parser with Kyoto University Text Corpus w/o Mainichi news articles
# Caveats: this option changes the feature set, so models trained w/o this option
#          are not compatible with the resulting jdepp binary
> configure --with-corpus=kyoto-partial --disable-autopos-train

# 2) make model using the entire corpus (for slightly better accuracy)
# Caveats: make calls scripts in tools/, which needs python3 (2022-03-18 or later)
> make model && make install

# or make model using a part (standard training split, if any) of the corpus,
#   if you want to know the accuracy of the installed parser
> make model-benchmark && make install

# See usage:training for other configuration options in building a model.

## (optional) MaxEnt [-l 2] requires Tsuruoka's MaxEnt implementation.
> wget http://www.logos.ic.i.u-tokyo.ac.jp/~tsuruoka/maxent/maxent-3.0.tar.gz
> cd src && ln -s ../maxent-X.Y/*.{cpp,h} . && cd ..
> configure --enable-maxent

For Mac OS X users: try port jdepp via MacPorts (special thanks to @hjym_u); it will build a standalone parser (--enable-standalone) using KNB corpus (--with-coprus=knbc) with auto POSs given by MeCab/jumandic (default).

Requirement

ToDo

History

Usage

Typing ./jdepp -h shows the following usage information.

J.DepP - Japanese Dependency Parser
Copyright (c) 2008-2012 Naoki Yoshinaga

Usage: jdepp [options] -- [learner options] -- [chunker classifier options] -- [parser classifier options] < test

test    test file

Optional parameters in training / testing:
  -t, --type=TYPE             select running mode of J.DepP
                                0 - learn
                              * 1 - parse
                                2 - both
                                3 - cache
  -e, --encoding=TYPE         select encoding of input
                              * 0 - UTF-8
                                1 - EUC-JP
  -i, --ignore=STR            ignore input line starting with STR
  -c, --corpus=FILE           training corpus in JDEPP format ('train.JDP')
  -m, --model-dir=DIR         model directory ('/Users/ynaga/local/lib/jdepp/model/kyoto')
  -p, --parser=TYPE           select parsing algorithm
                              * 0 - shift reduce
                                1 - cascaded chunking
                                2 - backward
                                3 - tournament
  -I, --input-format=TYPE     select type of input format
                              * 0 - POS-tagged sentences
                                1 - + BUNSETSU annotation
                                2 - + DEPENDENCY annotation

Optional parameters in training:
  -l, --learner=TYPE          select type of learning library
                              * 0 - OPAL
                                1 - SVM    (disabled)
                                2 - MaxEnt (disabled)
  -n, --max-sent=INT          max. # processing sentences (0: all)

Misc.:
  -v, --verbose=INT           verbosity level (0)
  -h, --help                  show this help and exit

Training

Type make model to build a model for J.DepP; by modifying configuration parameters, you can build various models from KyotoCorpus4.0 or KNB corpus.

# with Kyoto-University and NTT Blog corpus (will be automatically downloaded before training a model)
## training a parser with auto POSs given by MeCab/jumandic (this is the default parser configuration)
> configure (--with-corpus=knbc) (--with-postagger=mecab) (--with-mecab-dict=JUMAN)
## build a parser with auto POSs given by MeCab/NAIST-jdic
> configure (--with-corpus=knbc) (--with-postagger=mecab) --with-mecab-dict=NAIST-J
## build a parser with auto POSs given by MeCab/ipadic
> configure (--with-corpus=knbc) (--with-postagger=mecab) --with-mecab-dict=IPA
## build a standalone parser with a model trained using KNB corpus (requires MeCab)
> configure (--with-corpus=knbc) --enable-standalone

# with Kyoto University Text Corpus (put KyotoCorpus4.0 in the top of J.DepP source directory)
## training a parser with gold POSs (to measure the parsing accuracy)
> configure --with-corpus=kyoto --disable-autopos-train
## training a parser with auto POSs given by JUMAN
> configure --with-corpus=kyoto --with-postagger=juman

# bracketed configuration options are default (so you can omit it)

Alternatively, you can train a parser with your own corpus in the following way.

# prepare the training data in the JDEPP format (morphological analyzer output + dependency annotation)
# you can convert training data in the KyotoCorpus format into JDEPP format as follows
# to train a parser compatible with JUMAN
> cat train.KNP | \
awk '!/^(#|\*|E)/ {$0 = $1" "$2" "($3 == "*" ? $1 : $3)" "$4" 0 "$5" 0 "$6" 0 "$7" 0 NIL"}; 1' > train.JDP
# to train a parser compatible with MeCab/jumandic
> cat train.KNP | \
awk '!/^(#|\*|E)/ {$0 = $1"\t"$4","$5","$6","$7","($3 == "*" ? $1 : $3)","$2",*"}; 1' > train.JDP

# You may want to train J.DepP with auto POSs given by the front-end POS tagger
# to avoid the accuracy drop due to POS inconsistency; the training with auto POSs
# builds a more accurate parser than the training with gold POSs.
# JUMAN
> replace_pos.py juman -b < train.KNP > train.JDP
# MeCab/jumandic
> replace_pos.py mecab -d MECAB_DIC_DIR < train.KNP > train.JDP
# MeCab/ipadic or MeCab/naist-jdic
> replace_pos.py mecab -d MeCab_DIC_DIR < train.KNP > train.JDP

# create a directory to save a model
> mkdir model

# train chunker/parser with opal [-l 0], TinySVM [-l 1], or Tsuruoka's MaxEnt [-l 2]
# you can configure [learner options] delegated to the learner
# to see the learner's options, set -h to learner options
> jdepp -t 0 -c train.JDP -I 1 < test.JDP # chunker
> jdepp -t 0 -c train.JDP -I 2 < test.JDP # parser

# typical model hyper-parameters for a chunker
# PA without kernel; ultimately fast but less accurate
> jdepp -t 0 -c train.JDP -I 1 -- -t 0 -c 0.05 -i 40
# default parameters; reasonably fast and enough accurate (recommended)
> jdepp -t 0 -c train.JDP -I 1 -- -t 1 -d 2 -c 0.00005 -i 40

# typical model hyper-parameters for training a dependency parser
#  pa_pl0; PA with linear kernel; ultimately fast but less accurate
> jdepp -t 0 -c train.JDP -I 2 -- -t 0 -c 0.001 -i 40
#  pa_pl2; default parameters; reasonably fast and enough accurate (recommended)
> jdepp -t 0 -c train.JDP -I 2 -- -t 1 -d 2 -c 0.00005 -i 40
#  pa_pl3; PA1 with -d 3; slow but most accurate
> jdepp -t 0 -c train.JDP -I 2 -- -t 1 -d 3 -c 0.000001 -i 40

NOTE: The default parameters are tuned for training a shift-reduce parser with opal [-l 0 -p 0]; if you want to use the other parsing algorithms [-p 1|2|3] or estimators [-l 1|2], you should at least tune the regularization parameter [-c].

# build a feature sequence trie [to speed up a parser with d≥2 model]
#   note: you will gain a significant speed-up only with d>=3 models
# 1) apply J.DepP to a part of gigantic data (hopefully in the same domain you're going to analyze)
#    [the same format as train.JDP; no correct dependency annotation needed]
# 2) pass the data via [-c] to enumerate common feature sequences
#    (set [classifier option] accordingly)
# example (you can first consider the use of the training data):
> jdepp -t 3 -c train.JDP -I 1 -- -- -- -t 1 # (for chunker)
> jdepp -t 3 -c train.JDP -I 2 -- -- -- -t 1 -r 0.005 # (for parser)
> jdepp -t 3 -c train.JDP -I 2 -- -- -- -t 1 -s 0.015 # (for parser)

# you can configure [chunker|parser classifier options] to chunk|parse a sentence
# to see classifier's options, set -h to classifier options
# example:
#  pa_pl0 (when you trained a chunker|parser using opal [-t 0],
#  opal is also used for classification)
> jdepp < test.sent
#  pa_pl2 (default PKE classifier; pecco [-t 1 -s 0.015])
> jdepp -- -- -- -t 1 -s 0.015 < test.sent
#  pa_pl2 (slightly slower SPLIT classifier)
> jdepp -- -- -- -t 1 -r 0.005 < test.sent
#  pa_pl2 (slightly faster FST classifier [-t 2])
> jdepp -- -- -- -t 2 -s 0.015 -i 8 < test.sent

# classifier options can be omitted in testing

Profiling

# script `to_tree.py' helps you understand J.DepP's machine-friendly parser output.
> jdepp < test.sent | to_tree.py

# `to_chunk.py' and `to_tree.py' compactly visualize the parser output
# when you input sentences w/ annotations [-I 1,2],
> jdepp -I 1 -v -1 < dev.JDP | to_chunk.py -p | less -R
> jdepp -I 2 -v -1 < dev.JDP | to_tree.py  -p | less -R

# If you have an issue in rendering wide characters on Terminal of Mac OS X,
# try SIMBL plugin `TerminalEastAsianAmbiguousClearer'

NOTE: We highly recommend you to use a (default) passive aggressive algorithm [-l 0] to train classifiers for parsers, since its training speed is order of magnitude faster than SVM/MaxEnt and the accuracy of the resulting models are comparable to SVM.

Performance Comparison

The following table lists the statistics of models referred in usage section. The experiments were conducted on MacBook Air (Mid 2011), Mac OS X 10.7 over Intel Core i7 1.8Ghz CPU with 4GB main memory. Note that, for reference purpose, the parser is here configured with --disable-autopos-train (Training with gold POSs is disabled by default, because it is not appropriate when you run the parser with a POS tagger).

The parser accuracy is measured on the standard data-set (Kyoto University Text Corpus version 4.0; training: 9501<01-11>.KNP and 95<01-08>ED.KNP, testing: 9501<14-17>.KNP and 95<10-12>ED.KNP) [1]. Testing shows throughput (# POS-tagged/bunsetsu-segmented sentences per second in Mainichi news articles (EUC-JP) parsed by J.DepP.

Model IDAlgorithm
[-p]
opal optionpecco optionDep. Acc.
(%)
Sent. Acc.
(%)
Training
[s]
Testing
[sent./s]
pa_pl0linear [3]-t 0 -c 1.0e-3 -i 40 -Pn/a89.6149.763.2111385
pa_pl2linear [3]-t 1 -d 2 -c 5.0e-5 -i 40 -p-t 1 -r 0.005 -i 792.1257.8735.033467
pa_pl2 (default)linear [3]-t 1 -d 2 -c 5.0e-5 -i 40 -p-t 1 -s 0.015 -i 792.0957.7335.041289
pa_pl3linear [3]-t 1 -d 3 -c 1.0e-6 -i 40 -kp-t 1 -r 0.0592.2958.80259.83434
pa_pl3linear [3]-t 1 -d 3 -c 1.0e-6 -i 40 -kp-t 1 -s 0.00192.2258.43259.88051
pa_pl2 stacking KNP 4.16 with juman-7.0linear [3]-t 1 -d 2 -c 5.0e-5 -i 40 -p-t 1 -s 0.015 -i 792.92---
pa_pl3 stacking KNP 2.0 with juman-5.1linear [3](sorry, I forgot exact ones)(ditto)93.19---

You can further speed up a parser with a classifier (d≥3) by building a larger feature sequence trie from possible feature vectors generated by using the parser itself. The pecco paper might be helpful to tune hyper-parameters in training models with SVM/MaxEnt (they are trained with a smaller number of features, though).

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent look-up). Please be careful when you use this software for commercial use.

If you train J.DepP with Kyoto University Text Corpus (--with-corpus=kyoto, --with-corpus=kyoto-partial), you can use the trained model only for research purpose (Refer to http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html, in Japanese).

How to pronounce `J.DepP'?

Read as you want; hopefully enjoy with spelling out J.; wandering around J for Johnny, Juggling, Jeering, JIT or whatever, before going to most boring `Japanese'.

Acknowledgments

The developer thanks Prof. Daisuke Kawahara for his guidance in converting KNB corpus to Kyoto University Text Corpus format.

References

  1. K. Uchimoto, S. Sekine, and H. Isahara. Japanese Dependency Structure Analysis Based on Maximum Entropy Models. Proc. EACL, pp. 196--203, 1999.
  2. T. Kudo and Y. Matsumoto. Japanese Dependency Analysis using Cascaded Chunking. Proc. CoNLL, pp. 63--69. 2002.
  3. M. Sassano. Linear-Time Dependency Analysis for Japanese. Proc. COLING, pp. 8--14. 2004.
  4. M. Iwatate, M. Asahara, and Y. Matsumoto. Japanese Dependency Parsing Using a Tournament Model. Proc. COLING, pp. 361--368. 2008.
  5. Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty. Proc. ACL-IJCNLP, pp. 477-485.
  6. Y. Freund and R. E. Schapire. Large Margin Classification using the Perceptron Algorithm. Machine Learning 37(3):277-296, 1999.
  7. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online Passive-Aggressive Algorithms. JMLR 7(Mar):551--585. 2006.

Copyright © 2009 - 2012 Naoki Yoshinaga, All right Reserved.
XHTML 1.1 $ Last modified at Thu May 12 01:43:38 2016 $