← Back to mit.katmh.com

# Direct Access Arrays

• Most operations in a computer only allow for constant logical branching, like if statements
• But one operation allows for non-constant branching factor, specifically the ability to randomly access any memory address in constant time
• This operation allows an algorithm's decision tree to branch with large branching factor, as large as there is space in your computer
• To exploit this operation, we define a data structure called a direct access array, which is a normal static array that associates a semantic meaning with each array index location: any item $x$ with key $k$ will be stored at array index $k$

--

• Suppose we want to store a set of $n$ items, each with a unique integer key in the bounded range 0 to $u-1$ (some large number)
• We can store the items in a direct access array of length $u$
• Order operations will be slow: no guarantee onwhere the first, last, or next element is, may have to spend $u$ time
• To find an item having integer key $i$, a search algorithm can simply look in array slot $i$, worst-case constant time
• The cost is storage space: a direct access array must have a slot available for every possible key in range, and when $u$ is very large ccompared to the number of items stored, this can be wasteful or impossible

# Hashing

• Can we get the performance benefits of a direct access array while using $O(n)$ space when $n \ll u$?
• Smaller dynamic direct access array with $m = O(n)$ slots (instead of $u$), called a hash table
• Hash function/hash map: maps item keys to differen slos of the direct access array, $h(k): \{0, ..., u-1\} \rightarrow \{0,...,m-1\}$, $h(k)$ is the hash of integer key $k$
• If the hash function happens to be injective over the $n$ keys we're storing (no two keys map o the same direct access array index), then we can support worst-case constant time search, as the hash table simply acts as a direct access array over the smaller domain $m$
• But by pigeonhole principle, if the space of possible keys is larger than the number of array indices, i.e. $m < u$, then any hash function mapping $u$ possible keys to $m$ indices must map multiple keys to the same array index
• If $h(k_1) = h(k_2)$, then the hashes of $k_1$ and $k_2$ collide
• When collisions occur, we can store collisions somewhere else in the same direct access array (open addressing), or store them somewhere else (chaining)

# Chaining

• Colliding keys are stored separately from the original hash table
• Each hash table index holds a pointer to a chain, a separate data structure that supports the dynamic set interface, often a linked list or dynamic array (as long as find, insert, and delete take no more than linear time)
• Insert $x$ into the chain at index $h(x.key)$
• Find or delete $k$ from the chain at index $h(k)$
• If our chains only hold a constant number of items, then dynamic set operations will run in (amortized) constant time
• Simple Uniform Hashing Assumption (SUHA): A random key will be equally likely to hash to any of the $m$ buckets $\{0,...,m-1\}$, independently of any other keys
• All of the set operations take $O(1+\alpha)$ time, where $\alpha = n/m$ is the load factor of the hash table (number of elements divided by number of buckets)

• Different open addressing schemes are defined by different probing strategies, which dictate which slot to try next if the particular slot being looked at is occupied with a different key
• For a given key $k$, we first try to store $k$ in $h(k,0)$. If that's occupied, try $h(k,1)$, then $h(k,2)$, and so on
• When deleting keys, use lazy-deleting. Otherwise, one may prevent existing keys from being found. Use a special key that's treated as "occupied" when searching, but as "empty" when inserting.
• Assuming $h$ can be evaluated in $O(1)$ time, open addressing has similar performance to chaining, with some efficiencies for small load factors

# Probing Schemes

• Analog of SUHA: Uniform Hashing Assumption: The probe sequence of a random key is equally likely to be any of the $m!$ permutations of $\{0,...,m-1\}$

• Quite strong! Every subsequent probe is chosen uniformly among the cells not yet probed

## Linear Probing

• $h(k,i) = (h'(k) + i) \mod m$, where $h'$ is a regular hash function
• Doesn't satisfy the uniform hashing assumption: if two keys are mapped to the same initial slot by $h'$, they'll follow the same probing sequence, so there are $m$ possible permutations instead of the required $m!$

## Double Hashing

• $h(k,i) = (h_1(k) + i h_2(k)) \mod m$

• Ensure that $h_2$ doesn't map any key to 0
• Ensure that $h_2(k)$ is relatively prime to $m$, otherwise for $k$ one wouldn't reach all slots; easily done by choosing a prime $m$
• Example: $h_1(k) = k \mod m$ and $h_2(k) = 1 + (k \mod (m-1))$

# Hash Functions

• Division Method: $h(k) = (k \mod m)$
• Ideally, the performance of our data structure is independent of the keys we store
• One heuristic is to choose $m$ to be a large prime
• Assume SUHA holds: if the size of the hash table is $\Omega(n)$, then the average size of any chain will be $1 + (n-1)/\Omega(n) = O(1)$, thus enabling dynamic set operations in average-case constant time
• Note that to maintain $m = O(n)$, insertion and deletion may require us to rebuild the direct access array to a different size, choose a new hash function, and reinsert all the items. This can be done the same way as in dynamic arrays, leading to amortized bounds for dynamic operations

# Exercises

• Given unsorted array $A = [a_0,...,a_{n-1}]$, the DUPLICATES problem asks whether there exists two integers in the array with the same value
• Average-case $O(n)$ solution

• Hash each of the $n$ integers into the table, insertion taking average-case $O(1)$
• When inserting integer into chain, check integer against other integers already in chain, return if another integer in the chain has the same value
• Expected size of chains is $O(1)$, so check takes average $O(1)$, so algorithm runs in average-case $O(n)$