Show Lecture.Hashing as a slide show.
CS253 Hashing
Leonardo da Vinci’s Mona Lisa and John the Baptist
Hashing in General
To hash an object:
- Combine the bits of the object into a single number
(the hash value).
- Use that number (mod N ) as an index into an array
of buckets.
- Each bucket is a collection of data with the same hash value.
- If N is large enough, each bucket only contains a few values.
- A good hash adjusts the number of buckets dynamically.
- It can take a lot of space, but it’s fast: O(1) index to the
bucket, then an O(n ) seach in the bucket. A good hash
will keep the bucket size small.
Typical Hash Table
A hash table starts like this, an array of five (for instance) pointers,
all initially null.
0 1 2 3 4
┌─────┬─────┬─────┬─────┬─────┐
│ ● │ ● │ ● │ ● │ ● │
└─────┴─────┴─────┴─────┴─────┘
- Why five? A prime number gives us the best chance of
spreading out input data with a pattern.
- If our array size were even, and data were all multiples of 10,
then half of our buckets would be unused.
- Five is ludicrously small for actual code, but fits on a screen.
A real hash table might have thousands of buckets.
Typical Hash Table
After adding "animal"
and "vegetable"
:
0 1 2 3 4
┌─────┬─────┬─────┬─────┬─────┐
│ ● │ │ ● │ │ ● │
└─────┴──┼──┴─────┴──┼──┴─────┘
│ │
∨ ∨
┌────────┐ ┌───────────┐
│ animal │ │ vegetable │
└────────┘ └───────────┘
"animal"
hashed to 21, which is 1 (mod 5),
so it’s in bucket 1.
"vegetable"
hashed to 9823438, which is 3 (mod 5),
so it’s in bucket 3.
- We’re not delving in to the details of the
string ⇒ unsigned hash function being used here.
Typical Hash Table
After adding "mineral"
:
0 1 2 3 4
┌─────┬─────┬─────┬─────┬─────┐
│ ● │ │ ● │ │ ● │
└─────┴──┼──┴─────┴──┼──┴─────┘
│ │
∨ ∨
┌────────┐ ┌─────────┐ ┌───────────┐
│ animal │ │ mineral │──>│ vegetable │
└────────┘ └─────────┘ └───────────┘
"mineral"
hashed to 3673, which is 3 (mod 5),
so it also goes into bucket 3.
- Since that bucket was non-empty, it was added to the list
for that bucket.
- It doesn’t matter where in the linked list we add the new item,
so we added it at the start, which is easy.
Typical Hash Table
0 1 2 3 4
┌─────┬─────┬─────┬─────┬─────┐
│ ● │ │ ● │ │ ● │
└─────┴──┼──┴─────┴──┼──┴─────┘
│ │
∨ ∨
┌────────┐ ┌─────────┐ ┌───────────┐
│ animal │ │ mineral │──>│ vegetable │
└────────┘ └─────────┘ └───────────┘
- To traverse the table:
for each pointer in the array:
for each node in that linked list:
process that item
- The input order (animal/vegetable/mineral) may not resemble the
output order (animal/mineral/vegetable).
Expanding the Table
- Of course, if our five-pointer table gets too many items, then the
linked lists will get too long for an efficient linear search.
- When that happens, rehash : expand the table to eleven
(another prime) pointers and rearrange everything.
- Prime numbers are practical! Who’d’ve thought?
- Increasing the table by big jumps makes rehashing occur less often,
but wastes more space. It’s a trade-off!
- The scheme of roughly doubling the container size should remind
you of vector’s memory allocation technique.
So What?
- This is all very nice, and good for several assignments and a quiz
in Data Structures.
- It’s tricky, and easy to get wrong:
- What if a list is empty?
- How do we rehash without completely duplicating all the data?
- How do we traverse the container?
- How do we compute the next largest prime number, anyway?
- Fortunately, the C++ unordered_set, unordered_multiset,
unordered_map, and unordered_multimap containers do all of the
heavy lifting for you.
- Their semantics are similar to set, multiset,
map, and multimap, except for ordering.
Hashing in C++
unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
for (auto n : p)
cout << n << ' ';
19 17 11 7 5 3 13 2
- How many buckets were used? Who cares?
- What was the hash function used? Who cares!
- When does it rehash? Who cares?
- These all have default implementation-dependent answers.
- These can queried & changed.
I Care
OK, let’s say that we care. We can find out:
unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
cout << "Buckets: " << p.bucket_count() << '\n'
<< "Size: " << p.size() << '\n'
<< "Load: " << p.load_factor() << " of "
<< p.max_load_factor() << '\n';
for (size_t b = 0; b<p.bucket_count(); b++)
if (p.bucket_size(b))
cout << "Bucket " << b << ": "
<< p.bucket_size(b) << " items\n";
for (auto n : p)
cout << n << ' ';
Buckets: 11
Size: 8
Load: 0.727273 of 1
Bucket 0: 1 items
Bucket 2: 2 items
Bucket 3: 1 items
Bucket 5: 1 items
Bucket 6: 1 items
Bucket 7: 1 items
Bucket 8: 1 items
19 17 11 7 5 3 13 2
Variable Number of Buckets
The number of buckets (usually prime) increases,
based on how much data the hash contains:
unordered_set<int> us;
for (int r = 1; r <= 1e6; r*=10) {
us.reserve(r);
cout << r << ' ' << us.bucket_count() << '\n';
}
1 2
10 11
100 103
1000 1031
10000 10273
100000 107897
1000000 1056323
Load Factor
- A hash table has a load factor : average items per bucket.
- If this gets too large, the hash table rehashes
(allocates more buckets, puts everything in the new proper buckets).
- Of course, any particular bucket may contain many items,
due to a poor hash function, or unlucky data.
- unordered_set::load_factor()
-
Returns the current load factor for this hash table, defined as
unordered_set::size()/unordered_set::bucket_count()
.
- unordered_set::max_load_factor()
-
Returns maximum load factor tolerated before rehashing.
Optional argument: change the maximum load factor.
Load Factor Demo
unordered_multiset<double> us;
for (int i=0; i<1e6; i++)
us.insert(drand48());
cout << us.size() << '\n'
<< us.bucket_count() << '\n'
<< us.load_factor() << '\n'
<< us.max_load_factor() << '\n';
1000000
1447153
0.691012
1
Once we study random numbers, we’ll see better ways
of generating such things.
What are the Hash Values?
The process of hashing is converting any value
(integer, floating-point, vector, set, struct MyData
, etc.)
to an unsigned number.
We can find out the hash values, if we care:
cout << hash<int>()(253) << '\n'
<< hash<int>()(-253) << '\n'
<< hash<double>()(253.0) << '\n'
<< hash<float>()(253.0F) << '\n'
<< hash<long>()(253L) << '\n'
<< hash<unsigned>()(253U) << '\n'
<< hash<char>()('a') << '\n'
<< hash<bool>()(true) << '\n'
<< hash<string>()("253") << '\n'
<< hash<string>()("") << '\n'
<< hash<int *>()(new int) << '\n';
253
18446744073709551363
12026514335406308073
3703063408979182286
253
253
97
1
1899958766268164750
6142509188972423790
24306368
Not everything
Not all built-in types are hashable:
cout << hash<ostream>()(cout);
c.cc:1: error: use of deleted function 'std::hash<std::basic_ostream<char>
>::hash()'
cout << hash<nullptr_t>()(nullptr);
c.cc:1: error: use of deleted function 'std::hash<std::nullptr_t>::hash()'
int a[] = {11,22};
cout << hash<int[]>()(a);
c.cc:2: error: use of deleted function 'std::hash<int []>::hash()'
User-defined Types
It doesn’t know how to hash your types:
struct Point { float x, y; } p = {1.2, 3.4};
int main() {
cout << hash<Point>()(p);
}
c.cc:4: error: use of deleted function 'std::hash<Point>::hash()'
However, it can be taught.
User-defined Types
- Well, fine.
- What does unordered_set need to work with a type?
- a hash functor (to tell which bucket to go into)
- an equality comparison functor (to see if two values are the same)
User-defined Types
We can create a template specialization for std::hash<Point>
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
int main() {
cout << hash<Point>()(p);
}
11708950365973905104
User-defined Types
Still fails; needs ==
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
int main() {
unordered_set<Point> us;
us.insert(p);
}
In file included from /usr/include/c++/8/string:48,
from c.cc:1:
/usr/include/c++/8/bits/stl_function.h: In instantiation of 'constexpr bool std::equal_to<_Tp>::operator()(const _Tp&, const _Tp&) const [with _Tp = Point]':
/usr/include/c++/8/bits/hashtable_policy.h:1460: required from 'static bool std::__detail::_Equal_helper<_Key, _Value, _ExtractKey, _Equal, _HashCodeType, true>::_S_equals(const _Equal&, const _ExtractKey&, const _Key&, _HashCodeType, std::__detail::_Hash_node<_Value, true>*) [with _Key = Point; _Value = Point; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _HashCodeType = long unsigned int]'
/usr/include/c++/8/bits/hashtable_policy.h:1844: required from 'bool std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _H1, _H2, _Hash, _Traits>::_M_equals(const _Key&, std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _H1, _H2, _Hash, _Traits>::__hash_code, std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _H1, _H2, _Hash, _Traits>::__node_type*) const [with _Key = Point; _Value = Point; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _H1 = std::hash<Point>; _H2 = std::__detail::_Mod_range_hashing; _Hash = std::__detail::_Default_ranged_hash; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _H1, _H2, _Hash, _Traits>::__hash_code = long unsigned int; std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _H1, _H2, _Hash, _Traits>::__node_type = std::__detail::_Hash_node<Point, true>]'
/usr/include/c++/8/bits/hashtable.h:1562: required from 'std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__node_base* std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::_M_find_before_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _H1 = std::hash<Point>; _H2 = std::__detail::_Mod_range_hashing; _Hash = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__node_base = std::__detail::_Hash_node_base; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__hash_code = long unsigned int]'
/usr/include/c++/8/bits/hashtable.h:649: required from 'std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__node_type* std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::_M_find_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _H1 = std::hash<Point>; _H2 = std::__detail::_Mod_range_hashing; _Hash = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__node_type = std::__detail::_Hash_node<Point, true>; typename _Traits::__hash_cached = std::integral_constant<bool, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__hash_code = long unsigned int]'
/usr/include/c++/8/bits/hashtable.h:1830: required from 'std::pair<typename std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _H1, _H2, _Hash, _Traits>::iterator, bool> std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::_M_insert(_Arg&&, const _NodeGenerator&, std::true_type, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::size_type) [with _Arg = const Point&; _NodeGenerator = std::__detail::_AllocNode<std::allocator<std::__detail::_Hash_node<Point, true> > >; _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _H1 = std::hash<Point>; _H2 = std::__detail::_Mod_range_hashing; _Hash = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; typename std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _H1, _H2, _Hash, _Traits>::iterator = std::__detail::_Node_iterator<Point, true, true>; std::true_type = std::integral_constant<bool, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::size_type = long unsigned int]'
/usr/include/c++/8/bits/hashtable_policy.h:834: required from 'std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__ireturn_type std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::insert(const value_type&) [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _H1 = std::hash<Point>; _H2 = std::__detail::_Mod_range_hashing; _Hash = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__ireturn_type = std::pair<std::__detail::_Node_iterator<Point, true, true>, bool>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::value_type = Point]'
/usr/include/c++/8/bits/unordered_set.h:421: required from 'std::pair<typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator, bool> std::unordered_set<_Value, _Hash, _Pred, _Alloc>::insert(const value_type&) [with _Value = Point; _Hash = std::hash<Point>; _Pred = std::equal_to<Point>; _Alloc = std::allocator<Point>; typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator = std::__detail::_Node_iterator<Point, true, true>; std::unordered_set<_Value, _Hash, _Pred, _Alloc>::value_type = Point]'
c.cc:12: required from here
/usr/include/c++/8/bits/stl_function.h:356: error: no match for 'operator==' in
'__x == __y' (operand types are 'const Point' and 'const Point')
User-defined Types
Now, unordered_set works with a Point
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
bool operator==(const Point &a, const Point &b) {
return a.x==b.x && a.y==b.y;
}
int main() {
unordered_set<Point> us;
us.insert(p);
}
The Rules
- Usually, messing around in the
std::
namespace is forbidden.
- However, you may specialize templates in the
std::
namespace
for your own types.