Hash tables

!

Dynamic sets

Definition

A dynamic set, $S$, is a data structure that stores distinct elements.

We are interested to support the following functions:

Direct Addressing

Consider the case of dynamic sets:

!

!

A direct-address table is just an array $T$ of length $N$.

!

Implementation of direct address tables

def DIRECT_ADDRESS_SEARCH(T, k):
    return T[k]
def DIRECT_ADDRESS_INSERT(T, x):
    index = x.key
    return T[index] = x
def DIRECT_ADDRESS_DELETE(T, x):
    index = x.key
    T[index] = None

Properties of direct address tables

Hashing

Hash function

Hash Tables


!

PROBLEM: collision

A collision occurs if we have two $x, y\in S$ such that $x \not= y$, but $h(\mathrm{KEY}[x]) = h(\mathrm{KEY}[y])$.

!

Collision resolution

Before:

Each slot in the table stores a single element in the dynamic set.

After:

Each slot in the table stores a list of elements in the dynamic set. This is called chaining.

INSERT with chaining

def HASHTABLE_INSERT(T, x):
    index = h(x.key)
    if T[index] == None:
        T[index] = []
    T[index].append(x)

SEARCH with chaining

def HASHTABLE_SEARCH(T, key):
    index = h(key)
    if T[index] == None:
        return False
    else:
        for x in T[index]:
            if x.key == key:
                return x
        return False

DELETE with chaining

def HASHTABLE_DELETE(T, key):
    index = h(key)
    if T[i] == None:
        return
    else:
        remove(T[i], key)
def remove(array, key):
    for (i, x) in enumerate(array):
        if x.key == key:
            array[i:i+1] = []
            return

Analysis

!

def HASHTABLE_INSERT(T, x):
    index = h(x.key)
    if T[index] == None:
        T[index] = []
    T[index].append(x)

!

HASHTABLE_INSERT is done in $\Theta(1)$.

Analysis

!

def HASHTABLE_SEARCH(T, key):
    index = h(key)
    if T[i] == None:
        return False
    else:
        for x in T[i]:
            if x.key == key:
                return x
        return False

!

HASHTABLE_SEARCH is done in $\Theta(L)$ where $L$ is the average length of the lists.

If we assume uniform hash value distribution of the keys, we get:

$$ L = \frac{|S|}{|T|} $$

This is known as the load factor of the hash table.

Typically, we want to keep the load factor as 0.75:

$$ |T| \simeq 1.3 |S| $$

Properties of a good integer hash function

Assumption: a hash function $h$ maps all natural numbers $[0, 1, \dots]$ to a finite interval $[0 \dots m-1]$.


Uniformity:

Given any two keys, there is no relationship between $|a - b|$ and $|h(a) - h(b)|$.

So, if two keys are close to each other, there hash values are still randomly distributed in $[0, m]$.

The division method

def h_div(key, m):
    return key % m

The choice of $m$ can makes h_div rather poor in uniformity:

The multiplication method

import math

def h_mul(key, m, A):
    assert(0 < A < 1)
    prod = key * A
    frac = prod - math.floor(prod)
    return math.floor(m * frac)

This hash function is more versatile because we can set m freely. A needs to be selected wisely.

A good choice is:

$$ A = \frac{\sqrt{5} - 1}{2} \simeq 0.618034 $$

Hashing general byte arrays.

What if the key is not an integer, but rather a byte array?


Many solutions:

Universal hashing


Premise:

The source code of the hash table implementation is public and open sourced.

Universal hashing

What if we have a family of hash functions to choose from: H?

import random

class RandomHashTable:
  def __init__(self, m, H):
    self.h = random.sample(H, 1)
    self.T = [[] for i in range(m)]

  def insert(self, x):
    index = self.h(x.key)
    self.T[index].append(x)

  def search(self, key):
    index = self.h(key)
    for x in self.T[index]:
      if x.key == key:
        return x
    return None

! How does this stop the adversary?

Universal hashing

The choice of H must be good.

Consider the following bad choices of H:

Property of universal hashing

Definition:

The collection of hash functions H is universal if:

for all distinct keys $a, b$, the number of hash functions $h\in$ H with $h(a) = h(b)$ is at most $|H| / m$.

!

Theorem

If H is universal, then the implementation of RandomHashTable has collision with probability of 1/$m$.

Proof

Let `$H = {h_1, h_2, \dots h_n}$.

Let $a, b$ be two distinct keys: $a\not= b$.

What is the probability that RandomHashTable has a collision?

Let $h\in H$ be randomly selected.

What is the probability that $h(a) = h(b)$?

$$\begin{eqnarray} \mathbf{E}\{p(h(a) = p(h(b)): h\in H\} &=& p(h(a) = h(b) | h = h_1)p(h=h_1) \\ && + p(h(a) = h(b) | h = h_2)p(h=h_2) \\ && + p(h(a) = h(b) | h = h_3)p(h=h_3) + \dots \\ &=& |H| \times (1/m) \times (1/|H|) \\ &=& 1/m \end{eqnarray} $$

Q.E.D.

Designing Universal Hashing

Summary

!

back
© KEN PU, 2016