← Back to mit.katmh.com

# Hashing

I watched this on Saturday night, didn't go to this lecture or the last 2 recitations live. Here's basically an incomplete copy of the slides' content:

• Hashing: lets us implement sets more efficiently and search in constant time
• Application of sorting: search for things more efficiently
• Direct access array

• (Unsigned) integers to index into an array
• Implement a set with possible elements are 0...N-1 with an array of size N
• Ideally size of our set should scale like number of elements in set, not size of universe the elements come from
• Hash function maps elements in universe U to numbers from 0...m-1
• Initialize: allocate an array T (hash table) of size m: O(m) time
• Assuming h(x) is computable in O(1) time

• Insert(x): put x in T[h(x)]: O(1) time
• Find(x): check if x in T[h(x)]: O(1) time
• Delete(x): set position T[h(x)] to None: O(1) time
• Problem: hash collision; what if two x's have the same h(x)?
• By pigeonhole, any hash function will have collisions
• Simple uniform hashing function (SUHA): Each key x maps to a uniformly random h(x) in 0...m-1, independent of any other keys
• Load factor: $\alpha = n/m$, n = number of elements already in hash table, m = table size

• If =1, then every cell is full and if we try to add something, we'll have a collision
• Even if $\alpha$ is small and SUHA holds, we should expect collisions

• Birthday paradox: $m=365, n=23$, 50% chance of collision
• Dealing with collisions: chaining

• Use a linked list (or other sequence) to store collisions
• Average
• Number of elements per bucket is $\alpha$
• Insert: $\Theta(1+\alpha)$
• Find: $\Theta(1+\alpha)$
• Delete: $\Theta(1+\alpha)s$
• Worst case
• $x_1, ..., x_n, y$ all hash to same value
• After we insert $x_1,...,x_n$

• Insert($y$) takes $\Theta(n)$ time
• Find($y$) takes $\Theta(n)$ time
• Delete($x_n$) takes $\Theta(n)$ time
• No better than unsorted array (in fact, we have to allocate extra memory); hash tables are good in the average case
• Speed up insertions by inserting at head of list: $\Theta(1)$
• Choosing a hash function

• Very simple division hash (don't do this!): $h(x) = x \mod 2^k$
• Returns last $k$ bits in binary representation
• All $x$'s with the same last $k$ bits will collide
• Prime division hash (often good enough in practice): Pick prime $p$ larger than table size: $h(x) = (x \mod p) \mod m$ ($\mod m$ so we always get something between $0$ and $m-1$)
• Heuristic: In real data, patterns are unlikely to be correlated with $p$
• Any fixed simple hash function can be attacked
• 6.046 for provably good hash function using randomness
• Dealing with collisions: open addressing

• If T[h(x)] is full, probe again aat different location until you find an empty spot
• Probe sequence: $h(x,i)$
• Hmm review this
• Deletions
• Can't just make the cell empty, will mess up the probe sequence for values further along in probe sequence
• Put a "deleted" flag
• Can re-use a deleted cell to insert new item
• Linear probing
• Probing sequence: $h(x,i) = h(x) + i \mod m$
• Keep hopping by 1, loop around because of $\mod m$
• Pros: simple and efficient (looking at nearby elements is faster than far)
• Cons: very non-random! will build clusters of spaces, causing hash table to slow down and increasing load factor
• Double hashing
• $h(x,i) = h_1(x) + i \cdot h_2(x) \mod m$
• Hop by multiples of $h_2$ (which must never return 0)
• Pros: more random than linear probing
• Table doubling and amortized complexity

• What do you do when table fills up?
• Chaining: chains get longer, but table still works
• Open addressing: once you fill up, you're stuck
• Solution: when $\alpha$ too big, double the table
• Just like dynamic arrays (R03)
• Allocate new table of size $2m$, reinsert all elements
• In the long run, we have constant time, but individual operations might be slow
• Worst case vs average case vs amortized

• Worst: for a given hash function and input data, an insert/find/delete can take $O(n)$ time in the worst case
• Average: for random data, an insert/find/delete takes $O(1 + \alpha)$ time on average
• Amortized: for a sequene of many hash table operations, the average time per operation is $O(n)$, but individual operations may take $O(n)$ time
• Applications of hashing

• Efficient data structures (R04)
• Cryptography: message authentication, checksums
• Algorithms for searching (pset 2)