I watched this on Saturday night, didn't go to this lecture or the last 2 recitations live. Here's basically an incomplete copy of the slides' content:
- Hashing: lets us implement sets more efficiently and search in constant time
- Application of sorting: search for things more efficiently
-
Direct access array
- (Unsigned) integers to index into an array
- Implement a set with possible elements are 0...N-1 with an array of size N
- Ideally size of our set should scale like number of elements in set, not size of universe the elements come from
- Hash function maps elements in universe U to numbers from 0...m-1
- Initialize: allocate an array T (hash table) of size m: O(m) time
-
Assuming h(x) is computable in O(1) time
- Insert(x): put x in
T[h(x)]
: O(1) time
- Find(x): check if x in
T[h(x)]
: O(1) time
- Delete(x): set position
T[h(x)]
to None: O(1) time
- Problem: hash collision; what if two x's have the same h(x)?
- By pigeonhole, any hash function will have collisions
- Simple uniform hashing function (SUHA): Each key x maps to a uniformly random h(x) in 0...m-1, independent of any other keys
-
Load factor: α=n/m, n = number of elements already in hash table, m = table size
- If =1, then every cell is full and if we try to add something, we'll have a collision
-
Even if α is small and SUHA holds, we should expect collisions
- Birthday paradox: m=365,n=23, 50% chance of collision
-
Dealing with collisions: chaining
- Use a linked list (or other sequence) to store collisions
- Average
- Number of elements per bucket is α
- Insert: Θ(1+α)
- Find: Θ(1+α)
- Delete: Θ(1+α)s
- Worst case
- x1,...,xn,y all hash to same value
-
After we insert x1,...,xn
- Insert(y) takes Θ(n) time
- Find(y) takes Θ(n) time
- Delete(xn) takes Θ(n) time
- No better than unsorted array (in fact, we have to allocate extra memory); hash tables are good in the average case
- Speed up insertions by inserting at head of list: Θ(1)
-
Choosing a hash function
- Very simple division hash (don't do this!): h(x)=xmod2k
- Returns last k bits in binary representation
- All x's with the same last k bits will collide
- Prime division hash (often good enough in practice): Pick prime p larger than table size: h(x)=(xmodp)modm (modm so we always get something between 0 and m−1)
- Heuristic: In real data, patterns are unlikely to be correlated with p
- Any fixed simple hash function can be attacked
- 6.046 for provably good hash function using randomness
-
Dealing with collisions: open addressing
- If
T[h(x)]
is full, probe again aat different location until you find an empty spot
- Probe sequence: h(x,i)
- Hmm review this
- Deletions
- Can't just make the cell empty, will mess up the probe sequence for values further along in probe sequence
- Put a "deleted" flag
- Can re-use a deleted cell to insert new item
- Linear probing
- Probing sequence: h(x,i)=h(x)+imodm
- Keep hopping by 1, loop around because of modm
- Pros: simple and efficient (looking at nearby elements is faster than far)
- Cons: very non-random! will build clusters of spaces, causing hash table to slow down and increasing load factor
- Double hashing
- h(x,i)=h1(x)+i⋅h2(x)modm
- Hop by multiples of h2 (which must never return 0)
- Pros: more random than linear probing
-
Table doubling and amortized complexity
- What do you do when table fills up?
- Chaining: chains get longer, but table still works
- Open addressing: once you fill up, you're stuck
- Solution: when α too big, double the table
- Just like dynamic arrays (R03)
- Allocate new table of size 2m, reinsert all elements
- In the long run, we have constant time, but individual operations might be slow
-
Worst case vs average case vs amortized
- Worst: for a given hash function and input data, an insert/find/delete can take O(n) time in the worst case
- Average: for random data, an insert/find/delete takes O(1+α) time on average
- Amortized: for a sequene of many hash table operations, the average time per operation is O(n), but individual operations may take O(n) time
-
Applications of hashing
- Efficient data structures (R04)
- Cryptography: message authentication, checksums
- Algorithms for searching (pset 2)