← Back to mit.katmh.com

Direct Access Arrays, Hashing

· Read 02/26/2021

Direct Access Arrays

  • Most operations in a computer only allow for constant logical branching, like if statements
  • But one operation allows for non-constant branching factor, specifically the ability to randomly access any memory address in constant time
  • This operation allows an algorithm's decision tree to branch with large branching factor, as large as there is space in your computer
  • To exploit this operation, we define a data structure called a direct access array, which is a normal static array that associates a semantic meaning with each array index location: any item xx with key kk will be stored at array index kk

--

  • Suppose we want to store a set of nn items, each with a unique integer key in the bounded range 0 to u1u-1 (some large number)
  • We can store the items in a direct access array of length uu
  • Order operations will be slow: no guarantee onwhere the first, last, or next element is, may have to spend uu time
  • To find an item having integer key ii, a search algorithm can simply look in array slot ii, worst-case constant time
  • The cost is storage space: a direct access array must have a slot available for every possible key in range, and when uu is very large ccompared to the number of items stored, this can be wasteful or impossible

Hashing

  • Can we get the performance benefits of a direct access array while using O(n)O(n) space when nun \ll u?
  • Smaller dynamic direct access array with m=O(n)m = O(n) slots (instead of uu), called a hash table
  • Hash function/hash map: maps item keys to differen slos of the direct access array, h(k):{0,...,u1}{0,...,m1}h(k): \{0, ..., u-1\} \rightarrow \{0,...,m-1\}, h(k)h(k) is the hash of integer key kk
  • If the hash function happens to be injective over the nn keys we're storing (no two keys map o the same direct access array index), then we can support worst-case constant time search, as the hash table simply acts as a direct access array over the smaller domain mm
  • But by pigeonhole principle, if the space of possible keys is larger than the number of array indices, i.e. m<um < u, then any hash function mapping uu possible keys to mm indices must map multiple keys to the same array index
  • If h(k1)=h(k2)h(k_1) = h(k_2), then the hashes of k1k_1 and k2k_2 collide
  • When collisions occur, we can store collisions somewhere else in the same direct access array (open addressing), or store them somewhere else (chaining)

Chaining

  • Colliding keys are stored separately from the original hash table
  • Each hash table index holds a pointer to a chain, a separate data structure that supports the dynamic set interface, often a linked list or dynamic array (as long as find, insert, and delete take no more than linear time)
  • Insert xx into the chain at index h(x.key)h(x.key)
  • Find or delete kk from the chain at index h(k)h(k)
  • If our chains only hold a constant number of items, then dynamic set operations will run in (amortized) constant time
  • Simple Uniform Hashing Assumption (SUHA): A random key will be equally likely to hash to any of the mm buckets {0,...,m1}\{0,...,m-1\}, independently of any other keys
  • All of the set operations take O(1+α)O(1+\alpha) time, where α=n/m\alpha = n/m is the load factor of the hash table (number of elements divided by number of buckets)

Open Addressing

  • Different open addressing schemes are defined by different probing strategies, which dictate which slot to try next if the particular slot being looked at is occupied with a different key
  • For a given key kk, we first try to store kk in h(k,0)h(k,0). If that's occupied, try h(k,1)h(k,1), then h(k,2)h(k,2), and so on
  • When deleting keys, use lazy-deleting. Otherwise, one may prevent existing keys from being found. Use a special key that's treated as "occupied" when searching, but as "empty" when inserting.
  • Assuming hh can be evaluated in O(1)O(1) time, open addressing has similar performance to chaining, with some efficiencies for small load factors

Probing Schemes

  • Analog of SUHA: Uniform Hashing Assumption: The probe sequence of a random key is equally likely to be any of the m!m! permutations of {0,...,m1}\{0,...,m-1\}

    • Quite strong! Every subsequent probe is chosen uniformly among the cells not yet probed

Linear Probing

  • h(k,i)=(h(k)+i)modmh(k,i) = (h'(k) + i) \mod m, where hh' is a regular hash function
  • Doesn't satisfy the uniform hashing assumption: if two keys are mapped to the same initial slot by hh', they'll follow the same probing sequence, so there are mm possible permutations instead of the required m!m!

Double Hashing

  • h(k,i)=(h1(k)+ih2(k))modmh(k,i) = (h_1(k) + i h_2(k)) \mod m

    • Ensure that h2h_2 doesn't map any key to 0
    • Ensure that h2(k)h_2(k) is relatively prime to mm, otherwise for kk one wouldn't reach all slots; easily done by choosing a prime mm
  • Example: h1(k)=kmodmh_1(k) = k \mod m and h2(k)=1+(kmod(m1))h_2(k) = 1 + (k \mod (m-1))

Hash Functions

  • Division Method: h(k)=(kmodm)h(k) = (k \mod m)
  • Ideally, the performance of our data structure is independent of the keys we store
  • One heuristic is to choose mm to be a large prime
  • Assume SUHA holds: if the size of the hash table is Ω(n)\Omega(n), then the average size of any chain will be 1+(n1)/Ω(n)=O(1)1 + (n-1)/\Omega(n) = O(1), thus enabling dynamic set operations in average-case constant time
  • Note that to maintain m=O(n)m = O(n), insertion and deletion may require us to rebuild the direct access array to a different size, choose a new hash function, and reinsert all the items. This can be done the same way as in dynamic arrays, leading to amortized bounds for dynamic operations

Exercises

  • Given unsorted array A=[a0,...,an1]A = [a_0,...,a_{n-1}], the DUPLICATES problem asks whether there exists two integers in the array with the same value
  • Average-case O(n)O(n) solution

    • Hash each of the nn integers into the table, insertion taking average-case O(1)O(1)
    • When inserting integer into chain, check integer against other integers already in chain, return if another integer in the chain has the same value
    • Expected size of chains is O(1)O(1), so check takes average O(1)O(1), so algorithm runs in average-case O(n)O(n)