You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hash table = most common implementation for set/map
Techniques for Storing Data: Bushy BST
Limitations:
Items/keys must be comparable
Maintaining bushiness non-trivial
Using Data as an Index
Rather than using contents of array to store keys → use indices themselves as keys
Use data itself as array index
Downsides:
Extremely wasteful of memory (to support checking presence of all positive integers, need 2 billion booleans
Need some way to generalize beyond integers
Refined Approach
Treat string as n-digit base 27 number
c: 3rd letter of alphabet, a: 1st letter, t: 20th letter
Index of "cat" is $$3 \cdot 27^{2} + 1 \cdot 27 + 20 = 2234$$
Generalizing to Words
Convert each word to unique integer representation
Using 5 bits per letter equivalent to treating like base 32 number
Java requires that EVERY object provide a method that converts itself into an integer: hashCode()
Handling Collisions
Store list of $$N$$ items at position $$h$$
External/Separate Chaining
Storing all items that map to $$h$$ in a linked list
External Chaining Performance
Depends on # of items in "bucket"
If $$N$$ items distributed across $$M$$ buckets, average time grows with $$\frac{N}{M} = L$$ (load factor)
Average length of list = $$L$$
Average runtime = $$\Theta(L)$$
Performance depends on # of items in each "bucket"
Type
Load Factor $$L$$
External Chaining, Fixed Size
$$\Theta(N)$$
External Chaining w/ Resizing
$$\Theta(1) \text{ (amortized)}$$
Negative .hashCodes in Java
In Java, -1 % 4 == -1 → use Math.floorMod instead
Hash Table
Every item mapped to bucket # using hash function
Computing hash function consists of 2 steps:
Compute hashCode (integer between $$-2^{31}$$ & $$2^{31} - 1$$
Computing index = hashCode$$\mod M$$
If $$L = \frac{N}{M}$$ gets too large, increase $$M$$
If multiple items map to same bucket, have to resolve ambiguity
Two common techniques:
External Chaining (create list for each bucket)
Open Addressing
Hash Functions
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
Bit shifting left introduces 0 on right → loses bit on left of number's binary representation → clear information loss
Repeated bit shifting → gradually lose all info accumulated from earlier computation
More fields entering hashcode calculation → less effect on final result early fields have
Hash a Collection
Lists a lot like Strings: Collection of items each w/ own hashCode: