Direct Access Arrays
- Most operations in a computer only allow for constant logical branching, like if statements
- But one operation allows for non-constant branching factor, specifically the ability to randomly access any memory address in constant time
- This operation allows an algorithm's decision tree to branch with large branching factor, as large as there is space in your computer
- To exploit this operation, we define a data structure called a direct access array, which is a normal static array that associates a semantic meaning with each array index location: any item x with key k will be stored at array index k
--
- Suppose we want to store a set of n items, each with a unique integer key in the bounded range 0 to u−1 (some large number)
- We can store the items in a direct access array of length u
- Order operations will be slow: no guarantee onwhere the first, last, or next element is, may have to spend u time
- To find an item having integer key i, a search algorithm can simply look in array slot i, worst-case constant time
- The cost is storage space: a direct access array must have a slot available for every possible key in range, and when u is very large ccompared to the number of items stored, this can be wasteful or impossible
Hashing
- Can we get the performance benefits of a direct access array while using O(n) space when n≪u?
- Smaller dynamic direct access array with m=O(n) slots (instead of u), called a hash table
- Hash function/hash map: maps item keys to differen slos of the direct access array, h(k):{0,...,u−1}→{0,...,m−1}, h(k) is the hash of integer key k
- If the hash function happens to be injective over the n keys we're storing (no two keys map o the same direct access array index), then we can support worst-case constant time search, as the hash table simply acts as a direct access array over the smaller domain m
- But by pigeonhole principle, if the space of possible keys is larger than the number of array indices, i.e. m<u, then any hash function mapping u possible keys to m indices must map multiple keys to the same array index
- If h(k1)=h(k2), then the hashes of k1 and k2 collide
- When collisions occur, we can store collisions somewhere else in the same direct access array (open addressing), or store them somewhere else (chaining)
Chaining
- Colliding keys are stored separately from the original hash table
- Each hash table index holds a pointer to a chain, a separate data structure that supports the dynamic set interface, often a linked list or dynamic array (as long as find, insert, and delete take no more than linear time)
- Insert x into the chain at index h(x.key)
- Find or delete k from the chain at index h(k)
- If our chains only hold a constant number of items, then dynamic set operations will run in (amortized) constant time
- Simple Uniform Hashing Assumption (SUHA): A random key will be equally likely to hash to any of the m buckets {0,...,m−1}, independently of any other keys
- All of the set operations take O(1+α) time, where α=n/m is the load factor of the hash table (number of elements divided by number of buckets)
Open Addressing
- Different open addressing schemes are defined by different probing strategies, which dictate which slot to try next if the particular slot being looked at is occupied with a different key
- For a given key k, we first try to store k in h(k,0). If that's occupied, try h(k,1), then h(k,2), and so on
- When deleting keys, use lazy-deleting. Otherwise, one may prevent existing keys from being found. Use a special key that's treated as "occupied" when searching, but as "empty" when inserting.
- Assuming h can be evaluated in O(1) time, open addressing has similar performance to chaining, with some efficiencies for small load factors
Probing Schemes
-
Analog of SUHA: Uniform Hashing Assumption: The probe sequence of a random key is equally likely to be any of the m! permutations of {0,...,m−1}
- Quite strong! Every subsequent probe is chosen uniformly among the cells not yet probed
Linear Probing
- h(k,i)=(h′(k)+i)modm, where h′ is a regular hash function
- Doesn't satisfy the uniform hashing assumption: if two keys are mapped to the same initial slot by h′, they'll follow the same probing sequence, so there are m possible permutations instead of the required m!
Double Hashing
Hash Functions
- Division Method: h(k)=(kmodm)
- Ideally, the performance of our data structure is independent of the keys we store
- One heuristic is to choose m to be a large prime
- Assume SUHA holds: if the size of the hash table is Ω(n), then the average size of any chain will be 1+(n−1)/Ω(n)=O(1), thus enabling dynamic set operations in average-case constant time
- Note that to maintain m=O(n), insertion and deletion may require us to rebuild the direct access array to a different size, choose a new hash function, and reinsert all the items. This can be done the same way as in dynamic arrays, leading to amortized bounds for dynamic operations
Exercises