Direct Access Arrays, Hashing

· Read 02/26/2021

Direct Access Arrays

Most operations in a computer only allow for constant logical branching, like if statements
But one operation allows for non-constant branching factor, specifically the ability to randomly access any memory address in constant time
This operation allows an algorithm's decision tree to branch with large branching factor, as large as there is space in your computer
To exploit this operation, we define a data structure called a direct access array, which is a normal static array that associates a semantic meaning with each array index location: any item $x$ with key $k$ will be stored at array index $k$

Suppose we want to store a set of $n$ items, each with a unique integer key in the bounded range 0 to $u-1$ (some large number)
We can store the items in a direct access array of length $u$
Order operations will be slow: no guarantee onwhere the first, last, or next element is, may have to spend $u$ time
To find an item having integer key $i$ , a search algorithm can simply look in array slot $i$ , worst-case constant time
The cost is storage space: a direct access array must have a slot available for every possible key in range, and when $u$ is very large ccompared to the number of items stored, this can be wasteful or impossible

Can we get the performance benefits of a direct access array while using $O(n)$ space when $n \ll u$ ?
Smaller dynamic direct access array with $m = O(n)$ slots (instead of $u$ ), called a hash table
Hash function/hash map: maps item keys to differen slos of the direct access array, $h(k): \{0, ..., u-1\} \rightarrow \{0,...,m-1\}$ , $h(k)$ is the hash of integer key $k$
If the hash function happens to be injective over the $n$ keys we're storing (no two keys map o the same direct access array index), then we can support worst-case constant time search, as the hash table simply acts as a direct access array over the smaller domain $m$
But by pigeonhole principle, if the space of possible keys is larger than the number of array indices, i.e. $m < u$ , then any hash function mapping $u$ possible keys to $m$ indices must map multiple keys to the same array index
If $h(k_1) = h(k_2)$ , then the hashes of $k_1$ and $k_2$ collide
When collisions occur, we can store collisions somewhere else in the same direct access array (open addressing), or store them somewhere else (chaining)

Colliding keys are stored separately from the original hash table
Each hash table index holds a pointer to a chain, a separate data structure that supports the dynamic set interface, often a linked list or dynamic array (as long as find, insert, and delete take no more than linear time)
Insert $x$ into the chain at index $h(x.key)$
Find or delete $k$ from the chain at index $h(k)$
If our chains only hold a constant number of items, then dynamic set operations will run in (amortized) constant time
Simple Uniform Hashing Assumption (SUHA): A random key will be equally likely to hash to any of the $m$ buckets $\{0,...,m-1\}$ , independently of any other keys
All of the set operations take $O(1+\alpha)$ time, where $\alpha = n/m$ is the load factor of the hash table (number of elements divided by number of buckets)

Different open addressing schemes are defined by different probing strategies, which dictate which slot to try next if the particular slot being looked at is occupied with a different key
For a given key $k$ , we first try to store $k$ in $h(k,0)$ . If that's occupied, try $h(k,1)$ , then $h(k,2)$ , and so on
When deleting keys, use lazy-deleting. Otherwise, one may prevent existing keys from being found. Use a special key that's treated as "occupied" when searching, but as "empty" when inserting.
Assuming $h$ can be evaluated in $O(1)$ time, open addressing has similar performance to chaining, with some efficiencies for small load factors

Analog of SUHA: Uniform Hashing Assumption: The probe sequence of a random key is equally likely to be any of the $m!$ permutations of $\{0,...,m-1\}$
- Quite strong! Every subsequent probe is chosen uniformly among the cells not yet probed

$h(k,i) = (h'(k) + i) \mod m$ , where $h'$ is a regular hash function
Doesn't satisfy the uniform hashing assumption: if two keys are mapped to the same initial slot by $h'$ , they'll follow the same probing sequence, so there are $m$ possible permutations instead of the required $m!$

$h(k,i) = (h_1(k) + i h_2(k)) \mod m$
- Ensure that $h_2$ doesn't map any key to 0
- Ensure that $h_2(k)$ is relatively prime to $m$ , otherwise for $k$ one wouldn't reach all slots; easily done by choosing a prime $m$
Example: $h_1(k) = k \mod m$ and $h_2(k) = 1 + (k \mod (m-1))$

Division Method: $h(k) = (k \mod m)$
Ideally, the performance of our data structure is independent of the keys we store
One heuristic is to choose $m$ to be a large prime
Assume SUHA holds: if the size of the hash table is $\Omega(n)$ , then the average size of any chain will be $1 + (n-1)/\Omega(n) = O(1)$ , thus enabling dynamic set operations in average-case constant time
Note that to maintain $m = O(n)$ , insertion and deletion may require us to rebuild the direct access array to a different size, choose a new hash function, and reinsert all the items. This can be done the same way as in dynamic arrays, leading to amortized bounds for dynamic operations

Given unsorted array $A = [a_0,...,a_{n-1}]$ , the DUPLICATES problem asks whether there exists two integers in the array with the same value
Average-case $O(n)$ solution
- Hash each of the $n$ integers into the table, insertion taking average-case $O(1)$
- When inserting integer into chain, check integer against other integers already in chain, return if another integer in the chain has the same value
- Expected size of chains is $O(1)$ , so check takes average $O(1)$ , so algorithm runs in average-case $O(n)$