Perfect Bloom Structures (PBS)

Dr. Amjad Daoud

Abstract

Perfect Bloom (PBS) is a probabilistic structure that uses a perfect hash function of a static set S of keys to map all keys in S to distict locations with no collision. The PBS uses an additional bitvector to tell if a key is inserted so far or not. When all input keys are restricted to S, the PBS will return "Definitely Not Found" quickly. Otherwise if the key is in S, at least one entry will have a count of 1, returns "Definitely Found", and the perfect hash function will return its associated value. Order Preserving Perfect Bloom Structures can maintain any prior order as well.

Perfect Bloom Structure Algorithm

The steps of the algorithm for constructing a PBS are:

You have K keys, that you want to store perfectly
Choose a number N larger than K. This is the number of vertices in a random graph G (must be 1-orientable or acyclic), and also the size of the resulting Bloom array.
Pick two simple random hash functions f1, f2, that return values from 0 .. N-1.
Now, for all keys, you draw an edge between vertices f1(key) and f2(key) of the graph G, and associate the desired hash value with that edge.
G must be 1-orientable or acyclic; if not, go back to step 2.
Assign values to each vertex such that, for each key (edge), you can add the values for the two vertices and get the desired (hash) value for that key. If the graph is acyclic: pick a vertex, and assigne to it a value of 0. Then do a depth-first search, assigning values to new vertices so that they sum up properly.
f1, f2, and the vertex values of G now make up a perfect hash function and the Bloom array is a perfect Bloom.

For example, let S be the keys that are the abbreviations for the 50 states in uppercase, and their associated values chosen to be from 0 to 49.

  
  S(Key, Value) = {
     (AL, 0), (AK, 1), (AZ, 2), (AR, 3), (CA, 4),
     (CO, 5), (CT, 6), (DE, 7), (FL, 8), (GA, 9),
     (HI,10), (ID,11), (IL,12), (IN,13), (IA,14),
     (KS,15), (KY,16), (LA,17), (ME,18), (MD,19),
     (MA,20), (MI,21), (MN,22), (MS,23), (MO,24),
     (MT,25), (NE,26), (NV,27), (NH,28), (NJ,29),
     (NM,30), (NY,31), (NC,32), (ND,33), (OH,34),
     (OK,35), (OR,36), (PA,37), (RI,38), (SC,39),
     (SD,40), (TN,41), (TX,42), (UT,43), (VT,44),
     (VA,45), (WA,46), (WV,47), (WI,48), (WY,49)
     }

To construct a perfect Bloom PBS which is equivalent to an acyclic random graph, we set N = 66. Next, we choose two simple random hash functions f1 and f2:

  f1(key) = 15 * key[0] + 29 * key[1]   (mod N)

  f2(key) = 50 * key[0] + 61 * key[1]   (mod N)

Where key[0] and key[1] represent the integer (ASCII) value of the first and second letter in the key. For example:

  
  key      f1   f2
  ----------------
  AL  (0)  11   32
  CO  (5)  62   51
  SD (40)  49   48
  TN (41)  24   48
  WY (49)  58   11

The (mod N) is used to return values in the range 0 to 65 for all keys in S. Note also that even for the set of keys, the functions are not unique, e.g. f2(SD) = F2(TN) = 48.

Final distinct location for the Key K in S (note that values are actually chosen so that so keys are kept in sorted order; however values can be anything):

    
    perfect_hash(key) = G[f1(key)] + G[f2(key)]    (mod N)

The Perfect Bloom structure will use the same hash functions to build a counting bloom filter (CBF). Each key in S will have at least one distinct location. So when checking for the key K if we find that an entry has a count of 1, then the BPS will report "definitely there" at location computed by perfect_hash(key) instead of "probably there". If the key is not in the Perfect Bloom Structure then at least one entry will have a count of 0 and the PBS will report "definitely not there".

You can add keys from S to the Perfect Bloom PBS with an array of size 66:

Key:

Definitely not there.

Drawing all the edges between the vertices f1(key) and f2(key) using SVG, we have: