Show Lecture.Hashing as a slide show.
CS253 Hashing
Leonardo da Vinci’s Mona Lisa and John the Baptist
Inclusion
To use unordered_set or unordered_multiset, you need to:
#include <unordered_set>
To use unordered_map or unordered_multimap, you need to:
#include <unordered_map>
To use the class hash
:
#include <functional>
Hashing in General
To hash an object:
- Combine the bits of the object into a single number,
the hash value.
- The bits that make up the real value, e.g., in a string,
the chars, not the pointer.
- Use that number (mod N ) as an index into an array
of N buckets.
- Each bucket is a collection of data with the same hash value.
- If N is large enough, each bucket only contains a few values.
- A good hash adjusts the number of buckets dynamically.
- It can take a lot of space, but it’s fast : O(1)
index to the bucket, then an O(n ) seach in the bucket.
A good hash will keep the bucket size small.
Typical Hash Table
A hash table starts like this, an array of seven (for instance)
pointers, all initially null.
0 1 2 3 4 5 6
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ ● │ ● │ ● │ ● │ ● │ ● │ ● │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
- Why seven? A prime number gives us the best chance of
spreading out input data with a pattern.
- If our array size were even, and data were all multiples of 10,
then half of our buckets would be unused.
- Seven is ludicrously small for actual code, but fits on a screen.
A real hash table might have thousands of buckets.
Typical Hash Table
After adding "animal"
and "vegetable"
:
0 1 2 3 4 5 6
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ ● │ │ ● │ │ ● │ ● │ ● │
└─────┴──┼──┴─────┴──┼──┴─────┴─────┴─────┘
│ │
∨ ∨
┌────────┐ ┌───────────┐
│ animal │ │ vegetable │
└────────┘ └───────────┘
"animal"
hashed to 22, which is 1 (mod 7),
so it’s in bucket 1.
"vegetable"
hashed to 9823439, which is 3 (mod 7),
so it’s in bucket 3.
- We’re not delving in to the details of the
string ⇒ unsigned hash function being used here.
Typical Hash Table
After adding "mineral"
:
0 1 2 3 4 5 6
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ ● │ │ ● │ │ ● │ ● │ ● │
└─────┴──┼──┴─────┴──┼──┴─────┴─────┴─────┘
│ │
∨ ∨
┌────────┐ ┌─────────┐ ┌───────────┐
│ animal │ │ mineral │──>│ vegetable │
└────────┘ └─────────┘ └───────────┘
"mineral"
hashed to 3671, which is 3 (mod 7),
so it also goes into bucket 3.
- Since that bucket was non-empty, it was added to the list
for that bucket.
- It doesn’t matter where in the linked list we add the new item,
so we added it at the start, which is easy.
Typical Hash Table
0 1 2 3 4 5 6
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ ● │ │ ● │ │ ● │ ● │ ● │
└─────┴──┼──┴─────┴──┼──┴─────┴─────┴─────┘
│ │
∨ ∨
┌────────┐ ┌─────────┐ ┌───────────┐
│ animal │ │ mineral │──>│ vegetable │
└────────┘ └─────────┘ └───────────┘
- To traverse the table:
for each pointer in the array:
for each node in that linked list:
process that item
- The input order (animal/vegetable/mineral) may not resemble the
output order (animal/mineral/vegetable).
Expanding the Table
- Of course, if our seven-pointer table gets too many items, then the
linked lists will get too long for an efficient linear search.
- When that happens, rehash : expand the table to seventeen
(another prime) pointers and rearrange everything.
- Prime numbers are practical! Who’d’ve thought?
- Increasing the table by big jumps makes rehashing occur less often,
but wastes more space. It’s a trade-off!
- The scheme of roughly doubling the container size should remind
you of vector’s memory allocation technique.
So What?
- This is all very nice, and good for several assignments and a quiz
in Data Structures.
- It’s tricky, and easy to get wrong:
- What’s a good hash function for strings?
- What if a list is empty?
- How do we rehash without completely duplicating all the data?
- How do we traverse the container?
- How do we compute the next largest prime number?
- Fortunately, the C++ unordered_set, unordered_multiset,
unordered_map, and unordered_multimap containers do all of the
heavy lifting for you.
- Their semantics are similar to set, multiset,
map, and multimap, except for ordering.
Hashing in C++
unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
for (auto n : p)
cout << n << ' ';
19 17 13 11 7 5 3 2
- How many buckets were used? Who cares?
- What was the hash function used? Who cares!
- When does it rehash? Who cares?
- These all have default implementation-dependent answers,
which can queried & changed:
- Might set a large initial number of buckets, if you know that
lots of data is coming.
- Your data might not hash well with the default hash function,
so you write your own.
I Care
OK, let’s say that we care. We can find out:
unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
cout << "Buckets: " << p.bucket_count() << '\n'
<< "Size: " << p.size() << '\n'
<< "Load: " << p.load_factor() << " of "
<< p.max_load_factor() << '\n';
for (size_t b = 0; b<p.bucket_count(); b++)
if (p.bucket_size(b))
cout << "Bucket " << b << ": "
<< p.bucket_size(b) << " items\n";
for (auto n : p)
cout << n << ' ';
Buckets: 13
Size: 8
Load: 0.615385 of 1
Bucket 0: 1 items
Bucket 2: 1 items
Bucket 3: 1 items
Bucket 4: 1 items
Bucket 5: 1 items
Bucket 6: 1 items
Bucket 7: 1 items
Bucket 11: 1 items
19 17 13 11 7 5 3 2
Variable Number of Buckets
The number of buckets (usually prime) increases,
based on how much data the hash contains:
unordered_set<int> us;
for (int r = 1; r <= 1e6; r*=10) {
us.reserve(r);
cout << setw(8) << r << ' '
<< setw(8) << us.bucket_count() << '\n';
}
1 2
10 11
100 103
1000 1031
10000 10273
100000 107897
1000000 1056323
The unordered_set::reserve() method asks for at least that many
buckets, but the implementation is free to allocate more.
Load Factor
- A hash table has a load factor ,
defined as average number of items per bucket.
- If this gets too large, the hash table rehashes
(allocates more buckets, puts everything in the new proper buckets).
- Any bucket may contain many items, due to a poor hash function,
or unlucky data.
- unordered_set::load_factor()
-
Returns the current load factor for this hash table, defined as
unordered_set::size()/unordered_set::bucket_count()
.
- unordered_set::max_load_factor()
-
Returns/sets maximum load factor tolerated before rehashing.
Load Factor Demo
unordered_multiset<double> us;
for (int i=0; i<1e6; i++)
us.insert(drand48());
cout << us.size() << '\n'
<< us.bucket_count() << '\n'
<< us.load_factor() << '\n'
<< us.max_load_factor() << '\n';
1000000
1447153
0.691012
1
Real time: 626 ms
unordered_multiset<double> us;
us.max_load_factor(10);
for (int i=0; i<1e6; i++)
us.insert(drand48());
cout << us.size() << '\n'
<< us.bucket_count() << '\n'
<< us.load_factor() << '\n'
<< us.max_load_factor() << '\n';
1000000
126271
7.91947
10
Real time: 1.25 seconds
Once we study random numbers, we’ll see better ways
of generating such things.
What are the Hash Values?
The process of hashing is converting any value
(integer, floating-point, vector, set, struct MyData
, etc.)
to an unsigned number, as uniquely as we can.
We can find out the hash values, if we care:
cout << hex << setfill('0')
<< setw(16) << hash<int>()(253) << '\n'
<< setw(16) << hash<int>()(-253) << '\n'
<< setw(16) << hash<double>()(253.0) << '\n'
<< setw(16) << hash<float>()(253.0F) << '\n'
<< setw(16) << hash<long>()(253L) << '\n'
<< setw(16) << hash<unsigned>()(253U) << '\n'
<< setw(16) << hash<char>()('a') << '\n'
<< setw(16) << hash<bool>()(true) << '\n'
<< setw(16) << hash<string>()("253") << '\n'
<< setw(16) << hash<string>()("") << '\n'
<< setw(16) << hash<int *>()(new int) << '\n';
00000000000000fd
ffffffffffffff03
a6e6c311a0093ae9
3363ec8d00f382ce
00000000000000fd
00000000000000fd
0000000000000061
0000000000000001
1a5e026e774daa8e
553e93901e462a6e
00000000015942c0
Not everything
Not all standard types are hashable:
cout << hash<ostream>()(cout) << '\n';; // 🦡
c.cc:1: error: use of deleted function ‘std::hash<std::basic_ostream<char>
>::hash()’
int a[] = {11,22};
cout << hash<int[]>()(a) << '\n';; // 🦡
c.cc:2: error: use of deleted function ‘std::hash<int []>::hash()’
cout << hash<nullptr_t>()(nullptr) << '\n';
0
User-defined Types
It doesn’t know how to hash your types:
struct Point { float x, y; } p = {1.2, 3.4};
int main() {
cout << hash<Point>()(p); // 🦡
}
c.cc:4: error: use of deleted function ‘std::hash<Point>::hash()’
However, it can be taught.
User-defined Types
- Well, fine.
- What does unordered_set need to work with a type?
- a hash functor (to tell which bucket to go into)
- an equality comparison functor (to see if two values are the same)
User-defined Types
We can create a template specialization for std::hash<Point>
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
int main() {
cout << hash<Point>()(p);
}
11708950365973905104
User-defined Types
Still fails; needs ==
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
int main() {
unordered_set<Point> us;
us.insert(p); // 🦡
}
In file included from /usr/local/gcc/11.2.0/include/c++/11.2.0/string:48,
from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/locale_classes.h:40,
from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/ios_base.h:41,
from /usr/local/gcc/11.2.0/include/c++/11.2.0/ios:42,
from /s/bach/a/class/cs000/public_html/pmwiki/cookbook/c++-includes.h:5,
from <command-line>:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h: In instantiation of ‘constexpr bool std::equal_to<_Tp>::operator()(const _Tp&, const _Tp&) const [with _Tp = Point]’:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:1614: required from ‘bool std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::_M_equals(const _Key&, std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code, const std::__detail::_Hash_node_value<_Value, typename _Traits::__hash_cached::value>&) const [with _Key = Point; _Value = Point; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code = long unsigned int; typename _Traits::__hash_cached = std::__detail::_Hashtable_traits<true, true, true>::__hash_cached]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:1819: required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_before_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr = std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<Point, true> > >::__node_base*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:793: required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr = std::allocator<std::__detail::_Hash_node<Point, true> >::value_type*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:2084: required from ‘std::pair<typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator, bool> std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_insert(_Arg&&, const _NodeGenerator&, std::true_type) [with _Arg = const Point&; _NodeGenerator = std::__detail::_AllocNode<std::allocator<std::__detail::_Hash_node<Point, true> > >; _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; typename _Traits::__constant_iterators = std::__detail::_Hashtable_traits<true, true, true>::__constant_iterators; std::true_type = std::integral_constant<bool, true>]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:843: required from ‘std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::insert(const value_type&) [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type = std::pair<std::__detail::_Node_iterator<Point, true, true>, bool>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::value_type = Point]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/unordered_set.h:422: required from ‘std::pair<typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator, bool> std::unordered_set<_Value, _Hash, _Pred, _Alloc>::insert(const value_type&) [with _Value = Point; _Hash = std::hash<Point>; _Pred = std::equal_to<Point>; _Alloc = std::allocator<Point>; typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; std::unordered_set<_Value, _Hash, _Pred, _Alloc>::value_type = Point]’
c.cc:12: required from here
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h:356: error: no
match for ‘operator==’ in ‘__x == __y’ (operand types are ‘const
Point’ and ‘const Point’)
User-defined Types
Now, unordered_set works with a Point
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
bool operator==(const Point &a, const Point &b) {
return a.x==b.x && a.y==b.y;
}
// or could’ve specialized std::equal_to<Point>
int main() {
unordered_set<Point> us;
us.insert(p);
}
The Rules
- Usually, messing around in the
std::
namespace is forbidden.
- However, you may specialize templates in the
std::
namespace
for your own types.