Natural Language Processing Resources

Randomized Language Models

Minimal Perfect Hashing

Dr. Amjad M Daoud

Executive Director
Center of Innovations in Engineering and Computing Machinery
Voice MAP Search
BigD8TA Predictive Analytics

Combined Order Preserving Perfect Hashing and Bloom Filter

Presentation

 CICEM2013 Minimal Perfect Hashing Tutorial (powerpoint presentation)(HTML).

Minimal Perfect Hashing incredibly high impact applications:

  1. Used in the Disco DB project; a lightweight, open-source framework for distributed computing based on the MapReduce paradigm. Used to handle massive amounts of data at Nokia Research.

  2. Used in RAMPS (can preprocess the human genome in mere minutes) to align billions of short sequences of DNA (called reads), to a long reference genome many times faster than SOAP2 or Bowtie, and about 1000 times faster than GPU implementations such as SOAP3.

  3. Parts of the algorithm that was developed and published from 1989-1994 appeared in the 2009 Microsoft patent Perfect Multidimensional Spatial Hashing, Hugues H. Hoppe et al as Exemplary Hashing clause [0028], and claims 1,5,6,7,11,16.

  4. Used in Moses Statistical Machine Translation System http://www.statmt.org/moses/; see (Machine Translation course http://www.cicem.org/machine-translation/)

  5. Used in Google Translate and voice search (Speech recognition) to build huge randomized language models (http://www.google.com/patents/US8359201 Publication date: Jan 22, 2013).

  6. Fast and Scalable Packet Classification: The Perfect Hash Function performs exactly two external memory accesses to classify a packet. Using FPGA and one commodity SRAM chip, a throughput of 150 million packets per second and throughput of 100 Gbps for the shortest packets. Scales with faster SRAM chips.

  7. Used in linkless octrees to encode the storage locations of subdivided nodes using compact multi-level perfect hashing; combining coarse-to-fine hierarchical representation and optimal random accessibility.

  8. The algorithm is used to compute Google page rank Google Page Rank in C#;
  9. Fuzzy Tolerant Search with DWAGs and MPHF
  10. CMPH Library cmph.sourceforge.com
  11. MPHF in C#
  12. http://burtleburtle.net/bob/hash/perfect.html (splits keys into buckets by a first h1, sorts buckets by size, maps them in decreasing order so table[hash1(key)^hash2(key)] causes no collision).

Selected Perfect Hashing Publications

C/C++ Code

Python Code

Javascript Code