Hashing

· Read 02/25/2021

I watched this on Saturday night, didn't go to this lecture or the last 2 recitations live. Here's basically an incomplete copy of the slides' content:

Hashing: lets us implement sets more efficiently and search in constant time
Application of sorting: search for things more efficiently
Direct access array
- (Unsigned) integers to index into an array
- Implement a set with possible elements are 0...N-1 with an array of size N
Ideally size of our set should scale like number of elements in set, not size of universe the elements come from
Hash function maps elements in universe U to numbers from 0...m-1
Initialize: allocate an array T (hash table) of size m: O(m) time
Assuming h(x) is computable in O(1) time
- Insert(x): put x in T[h(x)]: O(1) time
- Find(x): check if x in T[h(x)]: O(1) time
- Delete(x): set position T[h(x)] to None: O(1) time
Problem: hash collision; what if two x's have the same h(x)?
By pigeonhole, any hash function will have collisions
Simple uniform hashing function (SUHA): Each key x maps to a uniformly random h(x) in 0...m-1, independent of any other keys
Load factor: $\alpha = n/m$ , n = number of elements already in hash table, m = table size
- If =1, then every cell is full and if we try to add something, we'll have a collision
Even if $\alpha$ is small and SUHA holds, we should expect collisions
- Birthday paradox: $m=365, n=23$ , 50% chance of collision
Dealing with collisions: chaining
- Use a linked list (or other sequence) to store collisions
- Average
- Number of elements per bucket is $\alpha$
- Insert: $\Theta(1+\alpha)$
- Find: $\Theta(1+\alpha)$
- Delete: $\Theta(1+\alpha)s$
- Worst case
- $x_1, ..., x_n, y$ all hash to same value
- After we insert $x_1,...,x_n$
  - Insert( $y$ ) takes $\Theta(n)$ time
  - Find( $y$ ) takes $\Theta(n)$ time
  - Delete( $x_n$ ) takes $\Theta(n)$ time
- No better than unsorted array (in fact, we have to allocate extra memory); hash tables are good in the average case
- Speed up insertions by inserting at head of list: $\Theta(1)$
Choosing a hash function
- Very simple division hash (don't do this!): $h(x) = x \mod 2^k$
- Returns last $k$ bits in binary representation
- All $x$ 's with the same last $k$ bits will collide
- Prime division hash (often good enough in practice): Pick prime $p$ larger than table size: $h(x) = (x \mod p) \mod m$ ( $\mod m$ so we always get something between $0$ and $m-1$ )
- Heuristic: In real data, patterns are unlikely to be correlated with $p$
- Any fixed simple hash function can be attacked
- 6.046 for provably good hash function using randomness
Dealing with collisions: open addressing
- If T[h(x)] is full, probe again aat different location until you find an empty spot
- Probe sequence: $h(x,i)$
- Hmm review this
- Deletions
- Can't just make the cell empty, will mess up the probe sequence for values further along in probe sequence
- Put a "deleted" flag
- Can re-use a deleted cell to insert new item
- Linear probing
- Probing sequence: $h(x,i) = h(x) + i \mod m$
- Keep hopping by 1, loop around because of $\mod m$
- Pros: simple and efficient (looking at nearby elements is faster than far)
- Cons: very non-random! will build clusters of spaces, causing hash table to slow down and increasing load factor
- Double hashing
- $h(x,i) = h_1(x) + i \cdot h_2(x) \mod m$
- Hop by multiples of $h_2$ (which must never return 0)
- Pros: more random than linear probing
Table doubling and amortized complexity
- What do you do when table fills up?
- Chaining: chains get longer, but table still works
- Open addressing: once you fill up, you're stuck
- Solution: when $\alpha$ too big, double the table
- Just like dynamic arrays (R03)
- Allocate new table of size $2m$ , reinsert all elements
- In the long run, we have constant time, but individual operations might be slow
Worst case vs average case vs amortized
- Worst: for a given hash function and input data, an insert/find/delete can take $O(n)$ time in the worst case
- Average: for random data, an insert/find/delete takes $O(1 + \alpha)$ time on average
- Amortized: for a sequene of many hash table operations, the average time per operation is $O(n)$ , but individual operations may take $O(n)$ time
Applications of hashing
- Efficient data structures (R04)
- Cryptography: message authentication, checksums
- Algorithms for searching (pset 2)