cedar - C++ implementation of efficiently-updatable double-array trie

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo
Skip to [ Features | Download | History | Usage | Performance | Tuning | References ]

About

Cedar implements an updatable double-array trie [1,2,3], which offers fast update/lookup for skewed queries in real-world data, e.g., counting words in text or mining conjunctive features in a classifier. Its lookup speed is significantly faster than C implementations of modern cache-conscious tries [3,4] (see performance comparison); even when keys are uniformly, randomly queried, its update/lookup speed is comparable to std::unordered_map.

This library is meant for those who underestimate a double-array trie as a dynamic container. It's actually fast.

If you make use of cedar for research or commercial purposes, the reference will be:

N. Yoshinaga and M. Kitsuregawa. A Self-adaptive Classifier for Efficient Text-stream Processing. Proc. COLING 2014, pp. 1091--1102. 2014.

Features

License: GNU GPLv2, LGPLv2.1, and BSD; or e-mail me for other licenses you want.

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/cedar-latest.tar.gz
> tar zxvf cedar-latest.tar.gz
> cd cedar-YYYY-MM-DD
> configure
> make install

Requirements

ToDo

History

Usage

Cedar provides the following class template:

template <typename value_type,
          const int     NO_VALUE  = nan <value_type>::N1,
          const int     NO_PATH   = nan <value_type>::N2,
          const bool    ORDERED   = true,
          const int     MAX_TRIAL = 1,
          const size_t  NUM_TRACKING_NODES = 0>
class da;

This declares a double array trie with value_type as record type; NO_VALUE and NO_PATH are error codes returned by exactMatchSearch() or traverse() if search fails due to no path or no value, respectively. The keys in a trie is sorted if ORDERED is set to true, otherwise siblings are stored in an inserted order (slightly faster update). MAX_TRIAL is used to control space and time in online trie construction (useful when you want to build a trie from binary keys). If NUM_TRACKING_NODES is set to a positive integer value, the trie keeps node IDs stored in its public data member tracking_node to be valid, even if update() relocates the nodes to different locations.

The default NO_VALUE/NO_PATH values for int record are -1 and -2 (same as darts-variants), while the values for float record are 0x7f800001 and 0x7f800002 (taken from NaN values in IEEE 754 single-precision binary floating-point format); they can be referred to by CEDAR_NO_VALUE and CEDAR_NO_PATH, respectively.

NOTE: value_type must be in less than or equal to four (or precisely sizeof (int)) bytes. This does not mean you cannot associate a key with a value in five or more bytes (e.g., int64_t, double or user-defined struct). You can associate any record with a key by preparing a value array by yourself and store its index in (int) the trie (cedar::da <int>).

Cedar supports legacy trie APIs adopted in darts-clone, while providing new APIs for updating a trie.

value_type& update (const char* key, size_t len = 0, value_type val = value_type (0))
Insert key with length = len and value = val. If len is not given, std::strlen() is used to get the length of key. If key has been already present int the trie, val is added to the current value by using operator+=. When you want to override the value, omit val and write a value onto the reference to the value returned by the function.
int erase (const char* key, size_t len = 0, size_t from = 0)
Erase key (suffix) of length = len at from (root node in default) in the trie if exists. If len is not given, std::strlen() is used to get the length of key. erase() returns -1 if the trie does not include the given key. Currently, erase() does not try to refill the resulting empty elements with the tail elements for compaction.
template <typename T>
size_t commonPrefixPredict (const char* key, T* result, size_t result_len, size_t len = 0, size_t from = 0)
Predict suffixes following given key of length = len from a node at from, and stores at most result_len elements in result. result must be allocated with enough memory by a user. To recover keys, supply result in type cedar::result_triple_type (members are value, length, and id) and supply id and length to the following function, suffix(). The function returns the total number of suffixes (including those not stored in result).
template <typename T>
void dump (T* result, const size_t result_len)
Recover all the keys from the trie. Use suffix() to obtain actual key strings (this function works as commonPrefixPredict() from the root). To get all the results, result must be allocated with enough memory (result_len = num_keys()) by a user.

NOTE: The above two functions are implemented by the following two tree-traversal functions, begin() and next(), which enumerate the leaf nodes of a given tree by a pre-order walk; to predict one key by one, use these functions directly.
int begin (size_t& from, size_t& len)
Traverse a (sub)tree rooted by a node at from and return a value associated with the first (left-most) leaf node of the subtree. If the trie has no leaf node, it returns CEDAR_NO_PATH. Upon successful completion, from will point to the leaf node, while len will be the depth of the node. If you specify some internal node of a trie as from (in other words from != 0), remember to specify the depth of that node in the trie as len.
int next (size_t& from, size_t& len, const size_t root = 0)
Traverse a (sub)tree rooted by a node at root from a leaf node of depth len at from and return a value of the next (right) leaf node. If there is no leaf node at right-hand side of the subtree, it returns CEDAR_NO_PATH. Upon successful completion, from will point to the next leaf node, while len will be the depth of the node. This function is assumed to be called after calling begin() or next().
void suffix (char* key, const size_t len, size_t to)
Recover a (sub)string key of length = len in a trie that reaches node to. key must be allocated with enough memory by a user (to store a terminal character, len + 1 bytes are needed). Users may want to call some node-search function to obtain a valid node address and the (maximum) value of the length of the suffix.
void restore()
When you load an immutable double array, extra data needed to do predict(), dump() and update() are on-demand recovered when the function executed. This will incur some overhead (at the first execution). To avoid this, a user can explicitly run restore() just after loading the trie. This command is not defined when you configure --enable-fast-load since the configuration allows you to directly save/load a mutable double array.

There are three major updates to the legacy APIs.

Performance Comparison

Implementations

We compared cedar (2014-06-24) with the following in-memory containers that support sequential insertions.

In the containers implemented in libdict, we store an integer value directly in a pointer variable to value so in 64bit addressing it wastes extra space of 4 (= sizeof (void*) - sizeof (int)) * # keys bytes. In crit-bit tree, scapegoat tree and AA tree, we store an integer value with a key by appending it to the key as char* (we slightly modified critbit0_insert() to take a value for the key).

We also compared cedar with the trie implementations that support serializations; the softwares other than cedar, doar and libtrie implement only static data structures that do not support update. We used a more space- and time-efficient batch insertion implemented in doar.

Succinct tries* and doar do not store records; it needs extra space to store records (sizeof (int) * #keys bytes) and additional time to lookup it.

Settings

Having a set of keys and a set of query strings, we first build a container that maps a key to a unique value, and then look up the queries to the container.

The following datasets are used for experiments.

The statistics of keyset are summarized as follow (a terminal symbol included):

KeysetTextBinary
Size [bytes]304,562,809144,703,811
Size (prefix) [bytes]263,848,821 144,417,733
Size (tail) [bytes]51,932,403570,117
# keys28,772,16926,194,354
Ave. length [bytes]10.585.52
Ave. length (prefix) [bytes]9.175.51
Ave. length (tail) [bytes]1.800.02
# nodes in trie78,252,61726,725,019
# nodes in trie (prefix)37,538,62926,439,275

From these statistics, we can compute the minimum size of double-array trie (cedar). It requires 8 * #nodes for internal nodes plus 8 * #keys for additional terminals that store 4-byte records. Concretely speaking, cedar requires at least 816.53 MiB (78,252,617 * 8 + 28,772,169 * 8 = 856,198,288) for text keyset, while it requires at least 403.74 MiB (26,725,019 * 8 + 26,194,354 * 8 = 423,354,984 bytes) for binary keyset.

Similarly, if a container stores raw keys and values, it requires at least 400.21MiB (304,562,809 + 4 * 28,772,169 = 419,651,485 bytes) for text keyset while it requires 237.92MiB (144,703,811 + 4 * 26,194,354 = 249,481,227) for binary keyset. Remember that a pointer (or offset) variable is additionally needed to access variable-length keys, while the memory alignment (e.g., 16 bytes in Mac OS X) and std::string incur severe overheads for short-length keys (if they are memory-allocated individually).

The statistics of queryset are summarized as follow (a terminal symbol included):

QuerysetTextBinary
Size [bytes]1,079,467,570915,918,218
# queries177,999,203216,784,450
# keys612,21916,839,036
Ave. length [bytes]6.064.23
Ave. key count290.7412.87

For the in-memory containers we measured the insertion and the lookup time per key. For the (static) tries we measured time to build the trie from the keyset (including time needed to save the trie into a hard disk; we alphabetically sort the keys in advance since some (static) tries require the sorted keys as input) and the lookup time per key.

The experiments were conducted on Mac OS X 10.8 Server over Intel Core i7-3720QM 2.6Ghz CPU with 16GB main memory. The benchmark codes are compiled for each container with gcc 4.9.2.

Text dataset

The following table lists the timings needed in inserting keys (Insert), looking up the keys (Lookup), and maximum resident set size (memory) occupied by the process in inserting a trie (Space).

SoftwareData StructureSpace
[MiB]
Insert
[ns/key]
Lookup
[ns/key]
loglog
cedarDouble-array trie1173.02631.0650.40
cedar ORDERED=falseDouble-array trie1174.09615.0050.39
cedarDouble-array reduced trie963.02636.2554.16
cedar ORDERED=falseDouble-array reduced trie963.00626.2554.93
cedarDouble-array prefix trie741.78810.7250.48
cedar ORDERED=falseDouble-array prefix trie671.66786.0249.99
libdatrie 0.2.8Double-array prefix trien/a**n/an/a
libtrie 0.1.1Double-array two-trie2756.308116.16185.85
daryDouble-array trie1119.041786.9379.96
doar 0.0.13Compacted double-array trie2285.2117687.6083.41
critbitCrit-bit (patricia) tree1457.021713.69752.49
libdictSplay tree1823.121541.48229.34
libdictTreap1823.131682.26902.43
libdictSkip list1852.861907.251265.79
Andersson tree libraryAA tree1457.022100.03337.14
C Containers libraryScapegoat tree1891.742380.65254.34
tst_vanillaternary search tree3318.751109.25129.12
Judy 1.0.5Judy trie SL897.59580.67142.64
hat-trie 0.1.0HAT-trie695.49916.0275.51
std::mapRed-black tree2506.271617.60851.33
std::unordered_mapHash table2471.60615.30170.41
array hashArray Hash1725.5617273.22330.76
CMPH 2.0Hash table2741.032744.92285.11
cpp-btree 1.0.1B-tree1744.961749.961080.04
sparsetable 2.0.2Sparse hash table1685.412635.32157.63
sparsetable 2.0.2 (dense)Hash table2335.04502.66123.33

* libdatrie takes 1 or more days to build a trie for this dataset.

The following table lists the timings needed in building a trie from sorted keyset (Build), looking up the keys (Lookup), maximum resident set size (memory) occupied by the process in inserting a trie (Space), and the size of trie saved in hard disk (Size).

SoftwareData StructureSpace
[MiB]
Size
[MiB]
Build
[ns/key]
Lookup
[ns/key]
loglog
cedarDouble-array trie832.82816.54183.5738.95
cedar ORDERED=falseDouble-array trie832.81816.54171.3839.07
cedarDouble-array reduced trie782.46642.38176.8643.48
cedar ORDERED=falseDouble-array reduced trie773.89642.38166.4643.51
cedarDouble-array prefix trie490.61488.38211.3338.88
cedar ORDERED=falseDouble-array prefix trie490.59488.35221.8739.07
libdatrie 0.2.8Double-array prefix trie1229.12644.97209955.04124.66
libtrie 0.1.1Double-array two-trie2312.11654.395401.59181.95
daryDouble-array trie897.75895.5451144.9257.90
doar 0.0.13Compacted double-array trie1937.25*334.59990.5148.00
Darts 0.32Double-array trie4306.02858.932387.8740.89
Darts-clone 0.32gDirected-acyclic word graph2311.39409.171339.1436.39
Darts-clone 0.32e5Compacted double-array trie2779.10309.311011.9259.42
DASTrie 1.0Compacted double-array trie2626.16383.3792634.8885.02
tx-trie 0.18LOUDS trie1791.10*113.11626.90972.32
ux-trie 0.1.9LOUDS two-trie2223.80*92.391229.111975.28
marisa-trie 0.2.4LOUDS nested patricia trie2036.49*87.27698.76194.87

* Doar and succinct tries need additional space for record: 28,772,169 * sizeof (int) = 109.76 MiB (115,088,676 bytes).

Binary dataset

The following table lists the timings needed in inserting keys (Insert), looking up the keys (Lookup), and maximum resident set size (memory) occupied by the process in inserting a trie (Space).

SoftwareData StructureSpace
[MiB]
Insert
[ns/key]
Lookup
[ns/key]
loglog
cedarDouble-array trie570.63187.3011.20
cedar ORDERED=falseDouble-array trie541.51159.3911.05
cedarDouble-array reduced trie495.72203.788.61
cedar ORDERED=falseDouble-array reduced trie497.11173.628.34
cedarDouble-array prefix trie515.62215.249.73
cedar ORDERED=falseDouble-array prefix trie516.01198.689.79
libdatrie 0.2.8Double-array prefix trien/a*n/an/a
libtrie 0.1.1Double-array two-trie2239.853116.3072.26
critbitCrit-bit (patricia) tree1219.07346.23130.28
libdictSplay tree1625.07159.7865.96
libdictTreap1625.07396.28165.44
libdictSkip list1652.16367.61200.53
daryDouble-array trien/a*n/an/a
Andersson tree libraryAA tree1219.07686.21328.73
C Containers libraryScapegoat tree1614.86689.91274.30
tst_vanillaternary search tree1641.52497.11172.15
Judy 1.0.5Judy trie SL558.62145.9440.63
hat-trie 0.1.0HAT-trie527.40623.7597.10
std::mapRed-black tree2031.37377.68244.07
std::unordered_mapHash table2036.66555.86286.45
array hashArray Hash1368.6717611.598048.47
CMPH 2.0Hash table2523.232664.79654.75
cpp-btree 1.0.1B-tree1308.98316.70266.43
sparsetable 2.0.2Sparse hash table1273.642042.93300.68
sparsetable 2.0.2 (dense)Hash table2185.34431.10145.07

* libdatrie and dary take 1 or more days to build a trie for this dataset.

The following table lists the timings needed in building a trie from sorted keyset (Build), looking up the keys (Lookup), maximum resident set size (memory) occupied by the process in inserting a trie (Space), and the size of trie saved in hard disk (Size).

SoftwareData StructureSpace
[MiB]
Size
[MiB]
Build
[ns/key]
Lookup
[ns/key]
loglog
cedarDouble-array trie473.25403.75186.378.14
cedar ORDERED=falseDouble-array trie473.25403.75157.778.17
cedarDouble-array reduced trie428.43368.14199.007.91
cedar ORDERED=falseDouble-array reduced trie430.52369.80168.107.84
cedarDouble-array prefix trie538.90484.64214.298.16
cedar ORDERED=falseDouble-array prefix trie540.95486.42199.548.12
libdatrie 0.2.8Double-array prefix trien/an/a**n/an/a
libtrie 0.1.1Double-array two-trie1865.07558.613327.9668.69
daryDouble-array trien/an/a**n/an/a
Darts 0.32Double-array trie1574.24406.03297.337.96
Darts-clone 0.32gDirected-acyclic word graph1620.09201.91668.058.47
Darts-clone 0.32e5Compacted double-array trie2527.04272.071168.0616.50
DASTrie 1.0Compacted double-array trien/an/a**n/an/a
tx-trie 0.18LOUDS trie1442.20*38.63432.56309.60
ux-trie 0.1.9LOUDS two-trie1550.13*38.23431.37990.65
marisa-trie 0.2.4LOUDS nested patricia trie1440.59*41.62278.92422.42

* Succinct tries need additional space for record: 26,194,354 * sizeof (int) = 99.92 MiB (104,777,416 bytes).

** libdatrie, dary and DASTrie take 1 or more days to build a trie for this dataset.

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent lookup). Please be careful when you use this software for commercial use.

How to pronounce `cedar'

The library is named after cedar, the most popular kind of tree in Japan; so just pronounce the same as cedar in English.

Acknowledgments

This software cannot be implemented without help of Dr. Yata, guru of double-array / succinct tries. The developer deeply appreciates his guidance on the implementation of the mechanism that controls space and time [2].

References

  1. J. Aoe. An efficient digital search algorithm by using a double-array structure. IEEE Transactions on Software Engineering 15(9):1066--1077. 1989.
  2. S. Yata and M. Tamura and K. Morita and M. Fuketa and J. Aoe. Sequential Insertions and Performance Evaluations for Double-arrays. Proc. the 71st National Convention of IPSJ, pp. 1263--1264. 2009.
  3. N. Yoshinaga and M. Kitsuregawa. A Self-addaptive Classifier for Efficient Text-stream Processing. Proc. COLING 2014, pp. 1091--1102. 2014.
  4. D. R. Morrison. PATRICIA -- practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM. 15(4): 514--534. 1968.
  5. D. Sleator and R. Tarjan. Self-adjusting binary search trees. Journal of the ACM, Vol 32(3), pp 652--686. July 1985. See demo in A demonstration of top-down splaying
  6. C. R Aragon, R. Seidel. Randomized Search Trees. Proc. of FOCS, pp. 540--545. 1989.
  7. W. Pugh. Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM 33: 668--676. 1990.
  8. I. Galperin and R. L. Rivest. Scapegoat trees. Proc. of SODA, pp. 165--174. 1993.
  9. A. Andersson. Balanced search trees made simple. Proc. Workshop on Algorithms and Data Structures, pp 60--71. 1993.
  10. J. L. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. Proc. of SODA. pp. 360--369. 1997.
  11. D. Baskins. A 10-minute description of how Judy arrays work and why they are so fast. http://judy.sourceforge.net/. 2004.
  12. N. Askitis and R. Sinha. HAT-trie: a cache-conscious trie-based data structure for strings. Proc. of the 30th Australasian conference on Computer science. Vol 62, pp. 97--105. 2007.
  13. N. Askitis and J. Zobel. Cache-conscious Collision Resolution in String Hash Tables. Proc. of SPIRE. pp. 91--102. 2005.
  14. F. C. Botelho, R. Pagh, N. Ziviani. Simple and space-efficient minimal perfect hash functions. Proc. of WADs. pp. 139--150. 2007.
  15. S. Yata. A compact static double-array keeping character codes. Information Processing & Management, 43(1), pp. 237--247. 2007
  16. G. Jacobson. Space-efficient static trees and graphs. Proc. SFCS, pp. 549--554. 1989.
  17. S. Yata. Dictionary Compression by Nesting Prefix/Patricia Tries. Proc. JNLP, pp. 576--578. 2011.

Copyright © 2012 Naoki Yoshinaga, All right Reserved.
XHTML 1.1 $ Last modified at Thu May 12 01:44:36 2016 $