cedar - C++ implementation of efficiently-updatable double-array trie

developed by Naoki Yoshinaga at Yoshinaga Lab., IIS, University of Tokyo
Skip to [ Features | Download | History | Usage | Performance | Contributions | References ]

About

Cedar implements an updatable double-array trie [1,2,3], which offers fast update/lookup for skewed queries in real-world data, e.g., counting words in text or mining conjunctive features in a classifier. Its update and lookup speed is comparable to (hard-to-serialize) hash-based containers (std::unordered_map) or modern cache-conscious tries [3,4] and even faster when the queries are short and skewed (see performance comparison).

The cedar is meant for those who still underestimate a double-array trie as a mutable container. The double-array tries are no longer slow to update. It's actually fast.

If you make use of cedar for research or commercial purposes, the reference will be:

N. Yoshinaga and M. Kitsuregawa. A Self-adaptive Classifier for Efficient Text-stream Processing. Proc. COLING 2014, pp. 1091--1102. 2014.

Features

License: GNU GPLv2, LGPLv2.1, and BSD; or e-mail me for other licenses you want.

Download & Setup

> wget http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/cedar-latest.tar.gz
> tar zxvf cedar-latest.tar.gz
> cd cedar-YYYY-MM-DD
> configure
> make install

Requirements

ToDo

History

Usage

Cedar provides the following class template:

template <typename value_type,
          const int     NO_VALUE  = nan <value_type>::N1,
          const int     NO_PATH   = nan <value_type>::N2,
          const bool    ORDERED   = true,
          const int     MAX_TRIAL = 1,
          const size_t  NUM_TRACKING_NODES = 0>
class da;

This declares a double array trie with value_type as record type; NO_VALUE and NO_PATH are error codes returned by exactMatchSearch() or traverse() if search fails due to no path or no value, respectively. The keys in a trie is sorted if ORDERED is set to true, otherwise siblings are stored in an inserted order (slightly faster update). MAX_TRIAL is used to control space and time in online trie construction (useful when you want to build a trie from binary keys). If NUM_TRACKING_NODES is set to a positive integer value, the trie keeps node IDs stored in its public data member tracking_node to be valid, even if update() relocates the nodes to different locations.

The default NO_VALUE/NO_PATH values for int record are -1 and -2 (same as darts-variants), while the values for float record are 0x7f800001 and 0x7f800002 (taken from NaN values in IEEE 754 single-precision binary floating-point format); they can be referred to by CEDAR_NO_VALUE and CEDAR_NO_PATH, respectively.

NOTE: value_type must be in less than or equal to four (or precisely sizeof (int)) bytes. This does not mean you cannot associate a key with a value in five or more bytes (e.g., int64_t, double or user-defined struct). You can associate any record with a key by preparing a value array by yourself and store its index in (int) the trie (cedar::da <int>).

Cedar supports legacy trie APIs adopted in darts-clone, while providing new APIs for updating a trie.

value_type& update (const char* key, size_t len = 0, value_type val = value_type (0))
Insert key with length = len and value = val. If len is not given, std::strlen() is used to get the length of key. If key has been already present int the trie, val is added to the current value by using operator+=. When you want to override the value, omit val and write a value onto the reference to the value returned by the function.
int erase (const char* key, size_t len = 0, size_t from = 0)
Erase key (suffix) of length = len at from (root node in default) in the trie if exists. If len is not given, std::strlen() is used to get the length of key. erase() returns -1 if the trie does not include the given key. Currently, erase() does not try to refill the resulting empty elements with the tail elements for compaction.
template <typename T>
size_t commonPrefixPredict (const char* key, T* result, size_t result_len, size_t len = 0, size_t from = 0)
Predict suffixes following given key of length = len from a node at from, and stores at most result_len elements in result. result must be allocated with enough memory by a user. To recover keys, supply result in type cedar::result_triple_type (members are value, length, and id) and supply id and length to the following function, suffix(). The function returns the total number of suffixes (including those not stored in result).
template <typename T>
void dump (T* result, const size_t result_len)
Recover all the keys from the trie. Use suffix() to obtain actual key strings (this function works as commonPrefixPredict() from the root). To get all the results, result must be allocated with enough memory (result_len = num_keys()) by a user.

NOTE: The above two functions are implemented by the following two tree-traversal functions, begin() and next(), which enumerate the leaf nodes of a given tree by a pre-order walk; to predict one key by one, use these functions directly.
int begin (size_t& from, size_t& len)
Traverse a (sub)tree rooted by a node at from and return a value associated with the first (left-most) leaf node of the subtree. If the trie has no leaf node, it returns CEDAR_NO_PATH. Upon successful completion, from will point to the leaf node, while len will be the depth of the node. If you specify some internal node of a trie as from (in other words from != 0), remember to specify the depth of that node in the trie as len.
int next (size_t& from, size_t& len, const size_t root = 0)
Traverse a (sub)tree rooted by a node at root from a leaf node of depth len at from and return a value of the next (right) leaf node. If there is no leaf node at right-hand side of the subtree, it returns CEDAR_NO_PATH. Upon successful completion, from will point to the next leaf node, while len will be the depth of the node. This function is assumed to be called after calling begin() or next().
void suffix (char* key, const size_t len, size_t to)
Recover a (sub)string key of length = len in a trie that reaches node to. key must be allocated with enough memory by a user (to store a terminal character, len + 1 bytes are needed). Users may want to call some node-search function to obtain a valid node address and the (maximum) value of the length of the suffix.
void restore()
When you load an immutable double array, extra data needed to do predict(), dump() and update() are on-demand recovered when the function executed. This will incur some overhead (at the first execution). To avoid this, a user can explicitly run restore() just after loading the trie. This command is not defined when you configure --enable-fast-load since the configuration allows you to directly save/load a mutable double array.

There are three major updates to the legacy APIs.

Performance Comparison

Implementations

We compared cedar (2014-06-24) with the following in-memory and mutable containers that support sequential insertions.

These containers do not support serialization. We also compared cedar with the trie implementations that support serializations; the softwares other than cedar implement only static or immutable data structures that do not support update.

The succinct tries (*) do not store records; it needs extra space to store records (sizeof (int) * #keys bytes) and additional time to lookup it.

Unless otherwise versions are stated, codes in the latest snapshots of git or svn repository were used.

Settings (updating now...)

Having a set of keys and a set of query strings, we first build a container that maps a key to a unique value, and then look up the queries to the container. The codes used to benchmark containers (bench.cc and bench_static.cc) have been updated from the ones included in the latest cedar package. The slow-to-build or slow-to-lookup comparison-based containers and immutable double arrays are excluded from the benchmark (the previous results are still available from here). The entire keys (and queries) are loaded into memory to exclude I/O overheads before insertion (and look-up). All the hash-based containers use const char* in stead of expensive std::string to miminize the memory footprint and to exclude any overheads, and adopt CityHash64 from Google's cityhash in stead of default hash functions for a fair comparison. To compute the maximum memory footprint required in inserting the keys into the data structures, we subtracted the memory usage after loading keys or queries into memory from the maximum memory usage of the process monitored by run.cc (for mac) (or run_linux.cc).

Since we are interested in the lookup performance for practical situations where a small number of frequent keys are queried significantly more than a large number of rare ones (Zipfian distribution), the following datasets are used for experiments.

The statistics of keyset are summarized as follow (a terminal symbol included):

KeysetTextBinary
Size [bytes]304,562,809144,703,811
Size (prefix) [bytes]263,848,821 144,417,733
Size (tail) [bytes]51,932,403570,117
# keys28,772,16926,194,354
Ave. length [bytes]10.585.52
Ave. length (prefix) [bytes]9.175.51
Ave. length (tail) [bytes]1.800.02
# nodes in trie78,252,61726,725,019
# nodes in trie (prefix)37,538,62926,439,275

From these statistics, we can compute the minimum size of double-array trie (cedar). It requires 8 * #nodes for internal nodes plus 8 * #keys for additional terminals that store 4-byte records. Concretely speaking, cedar requires at least 816.53 MiB (78,252,617 * 8 + 28,772,169 * 8 = 856,198,288) for text keyset, while it requires at least 403.74 MiB (26,725,019 * 8 + 26,194,354 * 8 = 423,354,984 bytes) for binary keyset.

Similarly, if a container stores raw keys and values, it requires at least 400.21MiB (304,562,809 + 4 * 28,772,169 = 419,651,485 bytes) for text keyset while it requires 237.92MiB (144,703,811 + 4 * 26,194,354 = 249,481,227) for binary keyset. Remember that a pointer (or offset) variable is additionally needed to access variable-length keys, while the memory alignment (e.g., 16 bytes in Mac OS X) incur severe overheads for short-length keys (if they are memory-allocated individually). If naively adopting std::string in hash-based containers, its space overheads are substantial (cf. previous resuls using std::string)

The statistics of queryset are summarized as follow (a terminal symbol included):

QuerysetTextBinary
Size [bytes]1,079,467,570915,918,218
# queries177,999,203216,784,450
# keys612,21916,839,036
Ave. length [bytes]6.064.23
Ave. key count290.7412.87

For the in-memory containers we measured the insertion and the lookup time per key. For the (static) tries we measured time to build the trie from the keyset (including time needed to save the trie into a hard disk; we alphabetically sort the keys in advance since some (static) tries require the sorted keys as input) and the lookup time per key.

The experiments were conducted on Mac OS X 10.11 over Intel Core i7-3720QM 2.6Ghz CPU with 16GB main memory. The benchmark codes are compiled for each container with gcc-7.1 -O2 -g -std=c++17. The best of five consective trials has been reported for each container.

Text dataset

The following table lists the timings needed in inserting keys (Insert), looking up the keys (Lookup), and maximum resident set size (memory) occupied by the process in inserting a trie (Space). Click the tabel headings for sorting.

SoftwareData StructureSpace
[MiB]
Insert
[ns/key]
Lookup
[ns/key]
cedarDouble-array trie1171.74651.4754.59
cedar ORDERED=falseDouble-array trie1173.05638.6253.98
cedarDouble-array reduced trie961.65656.2556.78
cedar ORDERED=falseDouble-array reduced trie961.55635.9856.49
cedarDouble-array prefix trie680.63845.2952.35
cedar ORDERED=falseDouble-array prefix trie678.29817.8553.95
Judy 1.0.5Judy trie SL928.95764.22188.75
hat-trieHAT-trie570.00620.4064.27
Array hashArray Hash1575.01691.8737.85
Hopscotch hashHopscotch Hash1581.38395.8048.25
sparseppHash table (sparse)1104.29686.6267.74
sparsetable 2.0.2 (dense)Hash table (dense)1941.53372.6543.74
std::unordered_mapHash table1843.65546.84103.38
loglog

The following table lists the timings needed in building a trie from sorted keyset (Build), looking up the keys (Lookup), maximum resident set size (memory) occupied by the process in inserting a trie (Space), and the size of trie saved in hard disk (Size).

SoftwareData StructureSpace
[MiB]
Size
[MiB]
Build
[ns/key]
Lookup
[ns/key]
cedarDouble-array trie1006.57816.54205.3942.95
cedar ORDERED=falseDouble-array trie1006.57816.54182.3542.23
cedarDouble-array reduced trie781.30642.38188.0044.58
cedar ORDERED=falseDouble-array reduced trie787.31642.38178.2244.38
cedarDouble-array prefix trie489.49488.38214.5839.98
cedar ORDERED=falseDouble-array prefix trie489.48488.35204.2940.96
Darts 0.32Double-array trie4305.09858.932522.8540.54
Darts-clone 0.32gDirected-acyclic word graph2310.43409.171359.9434.55
Darts-clone 0.32e5Compacted double-array trie2778.06309.311033.1959.12
tx-trie 0.18LOUDS trie1345.68*113.11603.93944.18
ux-trie 0.1.9LOUDS two-trie1729.32*92.39848.492059.98
marisa-trie 0.2.4LOUDS nested patricia trie2034.10*87.27682.64190.12
loglog

* The succinct tries need additional space for record: 28,772,169 * sizeof (int) = 109.76 MiB (115,088,676 bytes).

Binary dataset

The following table lists the timings needed in inserting keys (Insert), looking up the keys (Lookup), and maximum resident set size (memory) occupied by the process in inserting a trie (Space).

SoftwareData StructureSpace
[MiB]
Insert
[ns/key]
Lookup
[ns/key]
cedarDouble-array trie569.63188.9012.08
cedar ORDERED=falseDouble-array trie539.02155.4012.17
cedarDouble-array reduced trie494.02204.948.15
cedar ORDERED=falseDouble-array reduced trie495.35172.478.08
cedarDouble-array prefix trie537.62224.469.72
cedar ORDERED=falseDouble-array prefix trie539.35190.079.67
Judy 1.0.5Judy trie SL562.99231.4681.51
hat-trieHAT-trie449.61223.0250.09
Array hashArray Hash1395.09642.5175.27
Hopscotch hashHopscotch Hash1568.71395.4261.19
sparseppHash table (sparse)982.46700.29109.09
sparsetable 2.0.2 (dense)Hash table (dense)1924.07370.9367.52
std::unordered_mapHash table1685.14542.04212.21
loglog

The following table lists the timings needed in building a trie from sorted keyset (Build), looking up the keys (Lookup), maximum resident set size (memory) occupied by the process in inserting a trie (Space), and the size of trie saved in hard disk (Size).

SoftwareData StructureSpace
[MiB]
Size
[MiB]
Build
[ns/key]
Lookup
[ns/key]
cedarDouble-array trie486.53403.75190.488.42
cedar ORDERED=falseDouble-array trie486.73403.75162.698.66
cedarDouble-array reduced trie441.57368.14206.567.60
cedar ORDERED=falseDouble-array reduced trie443.65369.80173.497.31
cedarDouble-array prefix trie514.73484.64232.128.16
cedar ORDERED=falseDouble-array prefix trie516.20486.42193.078.22
Darts 0.32Double-array trie1573.27406.03338.327.70
Darts-clone 0.32gDirected-acyclic word graph1618.92201.91657.878.32
Darts-clone 0.32e5Compacted double-array trie2525.98272.071180.5515.43
tx-trie 0.18LOUDS trie1228.44*38.63372.62298.06
ux-trie 0.1.9LOUDS two-trie1152.00*38.23322.441036.02
marisa-trie 0.2.4LOUDS nested patricia trie1438.70*41.62261.31430.24
loglog

* The succinct tries need additional space for record: 26,194,354 * sizeof (int) = 99.92 MiB (104,777,416 bytes).

Conclusions: Overall, the cedar variants are comparable to the cache-conscious or optimized hash containers in terms of speed to update. The cedar variants are also comparable to those containers in terms of speed to lookup on the text datasets (keys with low branching factors and longer tails) and are significantly faster on the binary datasets (keys with with high branching factors). Although the cedar variants require more space to de/serialize the resulting tries than the other immutable double-array or succinct tries, they require significantly smaller spaces in building a trie. The double-array tries are no longer slow to update. It's actually fast.

Third-Party Contributions

For those who want to use Jagger in prgoramming languages other than C++, the following third-party contributions to ports and bindings (wrappers) are available.

Disclaimer

We do not guarantee that the implemented algorithms other than those proposed by us are patent-free; we regarded them to be patent-free simply because their implementations are available as (existing) open-source softwares (otherwise a simple patent lookup). Please be careful when you use this software for commercial use.

How to pronounce `cedar'

The library is named after cedar, the most popular kind of tree in Japan; so just pronounce the same as cedar in English.

Acknowledgments

This software cannot be implemented without help of Dr. Yata, guru of double-array / succinct tries. The developer deeply appreciates his guidance on the implementation of the mechanism that controls space and time [2].

References

  1. J. Aoe. An efficient digital search algorithm by using a double-array structure. IEEE Transactions on Software Engineering 15(9):1066--1077. 1989.
  2. S. Yata and M. Tamura and K. Morita and M. Fuketa and J. Aoe. Sequential Insertions and Performance Evaluations for Double-arrays. Proc. the 71st National Convention of IPSJ, pp. 1263--1264. 2009.
  3. N. Yoshinaga and M. Kitsuregawa. A Self-addaptive Classifier for Efficient Text-stream Processing. Proc. COLING 2014, pp. 1091--1102. 2014.
  4. D. R. Morrison. PATRICIA -- practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM. 15(4): 514--534. 1968.
  5. D. Sleator and R. Tarjan. Self-adjusting binary search trees. Journal of the ACM, Vol 32(3), pp 652--686. July 1985. See demo in A demonstration of top-down splaying
  6. C. R Aragon, R. Seidel. Randomized Search Trees. Proc. of FOCS, pp. 540--545. 1989.
  7. W. Pugh. Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM 33: 668--676. 1990.
  8. I. Galperin and R. L. Rivest. Scapegoat trees. Proc. of SODA, pp. 165--174. 1993.
  9. A. Andersson. Balanced search trees made simple. Proc. Workshop on Algorithms and Data Structures, pp 60--71. 1993.
  10. J. L. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. Proc. of SODA. pp. 360--369. 1997.
  11. D. Baskins. A 10-minute description of how Judy arrays work and why they are so fast. http://judy.sourceforge.net/. 2004.
  12. N. Askitis and R. Sinha. HAT-trie: a cache-conscious trie-based data structure for strings. Proc. of the 30th Australasian conference on Computer science. Vol 62, pp. 97--105. 2007.
  13. N. Askitis and J. Zobel. Cache-conscious Collision Resolution in String Hash Tables. Proc. of SPIRE. pp. 91--102. 2005.
  14. M. Herlihy and N. Shavit and M. Tzafrir. Hopscotch Hashing. Proc. of DISC. pp. 350--364. 2008.
  15. S. Yata. A compact static double-array keeping character codes. Information Processing & Management, 43(1), pp. 237--247. 2007
  16. G. Jacobson. Space-efficient static trees and graphs. Proc. SFCS, pp. 549--554. 1989.
  17. S. Yata. Dictionary Compression by Nesting Prefix/Patricia Tries. Proc. JNLP, pp. 576--578. 2011.

Copyright © 2012 Naoki Yoshinaga, All right Reserved.
XHTML 1.1 $ Last modified at Thu May 12 01:44:36 2016 $