← Back to mit.katmh.com

Hashing

· Read 02/25/2021

I watched this on Saturday night, didn't go to this lecture or the last 2 recitations live. Here's basically an incomplete copy of the slides' content:

  • Hashing: lets us implement sets more efficiently and search in constant time
  • Application of sorting: search for things more efficiently
  • Direct access array

    • (Unsigned) integers to index into an array
    • Implement a set with possible elements are 0...N-1 with an array of size N
  • Ideally size of our set should scale like number of elements in set, not size of universe the elements come from
  • Hash function maps elements in universe U to numbers from 0...m-1
  • Initialize: allocate an array T (hash table) of size m: O(m) time
  • Assuming h(x) is computable in O(1) time

    • Insert(x): put x in T[h(x)]: O(1) time
    • Find(x): check if x in T[h(x)]: O(1) time
    • Delete(x): set position T[h(x)] to None: O(1) time
  • Problem: hash collision; what if two x's have the same h(x)?
  • By pigeonhole, any hash function will have collisions
  • Simple uniform hashing function (SUHA): Each key x maps to a uniformly random h(x) in 0...m-1, independent of any other keys
  • Load factor: α=n/m\alpha = n/m, n = number of elements already in hash table, m = table size

    • If =1, then every cell is full and if we try to add something, we'll have a collision
  • Even if α\alpha is small and SUHA holds, we should expect collisions

    • Birthday paradox: m=365,n=23m=365, n=23, 50% chance of collision
  • Dealing with collisions: chaining

    • Use a linked list (or other sequence) to store collisions
    • Average
    • Number of elements per bucket is α\alpha
    • Insert: Θ(1+α)\Theta(1+\alpha)
    • Find: Θ(1+α)\Theta(1+\alpha)
    • Delete: Θ(1+α)s\Theta(1+\alpha)s
    • Worst case
    • x1,...,xn,yx_1, ..., x_n, y all hash to same value
    • After we insert x1,...,xnx_1,...,x_n

      • Insert(yy) takes Θ(n)\Theta(n) time
      • Find(yy) takes Θ(n)\Theta(n) time
      • Delete(xnx_n) takes Θ(n)\Theta(n) time
    • No better than unsorted array (in fact, we have to allocate extra memory); hash tables are good in the average case
    • Speed up insertions by inserting at head of list: Θ(1)\Theta(1)
  • Choosing a hash function

    • Very simple division hash (don't do this!): h(x)=xmod2kh(x) = x \mod 2^k
    • Returns last kk bits in binary representation
    • All xx's with the same last kk bits will collide
    • Prime division hash (often good enough in practice): Pick prime pp larger than table size: h(x)=(xmodp)modmh(x) = (x \mod p) \mod m (modm\mod m so we always get something between 00 and m1m-1)
    • Heuristic: In real data, patterns are unlikely to be correlated with pp
    • Any fixed simple hash function can be attacked
    • 6.046 for provably good hash function using randomness
  • Dealing with collisions: open addressing

    • If T[h(x)] is full, probe again aat different location until you find an empty spot
    • Probe sequence: h(x,i)h(x,i)
    • Hmm review this
    • Deletions
    • Can't just make the cell empty, will mess up the probe sequence for values further along in probe sequence
    • Put a "deleted" flag
    • Can re-use a deleted cell to insert new item
    • Linear probing
    • Probing sequence: h(x,i)=h(x)+imodmh(x,i) = h(x) + i \mod m
    • Keep hopping by 1, loop around because of modm\mod m
    • Pros: simple and efficient (looking at nearby elements is faster than far)
    • Cons: very non-random! will build clusters of spaces, causing hash table to slow down and increasing load factor
    • Double hashing
    • h(x,i)=h1(x)+ih2(x)modmh(x,i) = h_1(x) + i \cdot h_2(x) \mod m
    • Hop by multiples of h2h_2 (which must never return 0)
    • Pros: more random than linear probing
  • Table doubling and amortized complexity

    • What do you do when table fills up?
    • Chaining: chains get longer, but table still works
    • Open addressing: once you fill up, you're stuck
    • Solution: when α\alpha too big, double the table
    • Just like dynamic arrays (R03)
    • Allocate new table of size 2m2m, reinsert all elements
    • In the long run, we have constant time, but individual operations might be slow
  • Worst case vs average case vs amortized

    • Worst: for a given hash function and input data, an insert/find/delete can take O(n)O(n) time in the worst case
    • Average: for random data, an insert/find/delete takes O(1+α)O(1 + \alpha) time on average
    • Amortized: for a sequene of many hash table operations, the average time per operation is O(n)O(n), but individual operations may take O(n)O(n) time
  • Applications of hashing

    • Efficient data structures (R04)
    • Cryptography: message authentication, checksums
    • Algorithms for searching (pset 2)