CS253: Software Development with C++

Spring 2022

Hashing

Show Lecture.Hashing as a slide show.

CS253 Hashing

Leonardo da Vinci’s Mona Lisa and John the Baptist

Inclusion

To use unordered_set or unordered_multiset, you need to:

    
#include <unordered_set>

To use unordered_map or unordered_multimap, you need to:

    
#include <unordered_map>

To use the class hash:

    
#include <functional>

Hashing in General

To hash an object:

Typical Hash Table

A hash table starts like this, an array of seven (for instance) pointers, all initially null.

         0     1     2	   3	 4     5     6
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │	 ●  │  ●  │  ●	│  ●  │	 ●  │  ●  │  ●	│
      └─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Typical Hash Table

After adding "animal" and "vegetable":

         0     1     2	   3	 4     5     6
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │	 ●  │	  │  ●	│     │	 ●  │  ●  │  ●  │
      └─────┴──┼──┴─────┴──┼──┴─────┴─────┴─────┘
               │	   │
               ∨	   ∨
         ┌────────┐	┌───────────┐
         │ animal │	│ vegetable │
         └────────┘	└───────────┘

Typical Hash Table

After adding "mineral":

         0     1     2	   3	 4     5     6
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │	 ●  │	  │  ●	│     │	 ●  │  ●  │  ●  │
      └─────┴──┼──┴─────┴──┼──┴─────┴─────┴─────┘
               │	   │
               ∨	   ∨
         ┌────────┐	┌─────────┐   ┌───────────┐
         │ animal │	│ mineral │──>│ vegetable │
         └────────┘	└─────────┘   └───────────┘

Typical Hash Table

         0     1     2	   3	 4     5     6
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │	 ●  │	  │  ●	│     │	 ●  │  ●  │  ●  │
      └─────┴──┼──┴─────┴──┼──┴─────┴─────┴─────┘
               │	   │
               ∨	   ∨
         ┌────────┐	┌─────────┐   ┌───────────┐
         │ animal │	│ mineral │──>│ vegetable │
         └────────┘	└─────────┘   └───────────┘

Expanding the Table

So What?

Hashing in C++

unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
for (auto n : p)
    cout << n << ' ';
19 17 13 11 7 5 3 2 

I Care

OK, let’s say that we care. We can find out:

unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
cout << "Buckets: " << p.bucket_count() << '\n'
     << "Size: " << p.size() << '\n'
     << "Load: " << p.load_factor() << " of "
     << p.max_load_factor() << '\n';
for (size_t b = 0; b<p.bucket_count(); b++)
    if (p.bucket_size(b))
        cout << "Bucket " << b << ": "
             << p.bucket_size(b) << " items\n";
for (auto n : p)
    cout << n << ' ';
Buckets: 13
Size: 8
Load: 0.615385 of 1
Bucket 0: 1 items
Bucket 2: 1 items
Bucket 3: 1 items
Bucket 4: 1 items
Bucket 5: 1 items
Bucket 6: 1 items
Bucket 7: 1 items
Bucket 11: 1 items
19 17 13 11 7 5 3 2 

Variable Number of Buckets

The number of buckets (usually prime) increases, based on how much data the hash contains:

unordered_set<int> us;
for (int r = 1; r <= 1e6; r*=10) {
    us.reserve(r);
    cout << setw(8) << r << ' '
         << setw(8) << us.bucket_count() << '\n';
}
       1        2
      10       11
     100      103
    1000     1031
   10000    10273
  100000   107897
 1000000  1056323

The unordered_set::reserve() method asks for at least that many buckets, but the implementation is free to allocate more.

Load Factor

unordered_set::load_factor()
Returns the current load factor for this hash table, defined as unordered_set::size()/unordered_set::bucket_count().
unordered_set::max_load_factor()
Returns/sets maximum load factor tolerated before rehashing.

Load Factor Demo

unordered_multiset<double> us;
for (int i=0; i<1e6; i++)
    us.insert(drand48());
cout << us.size()            << '\n'
     << us.bucket_count()    << '\n'
     << us.load_factor()     << '\n'
     << us.max_load_factor() << '\n';
1000000
1447153
0.691012
1

Real time: 550 ms

unordered_multiset<double> us;
us.max_load_factor(10);
for (int i=0; i<1e6; i++)
    us.insert(drand48());
cout << us.size()            << '\n'
     << us.bucket_count()    << '\n'
     << us.load_factor()     << '\n'
     << us.max_load_factor() << '\n';
1000000
126271
7.91947
10

Real time: 1.06 seconds

Once we study random numbers, we’ll see better ways of generating such things.

What are the Hash Values?

The process of hashing is converting any value (integer, floating-point, vector, set, struct MyData, etc.) to an unsigned number, as uniquely as we can.

We can find out the hash values, if we care:

cout << hex << setfill('0')
     << setw(16) << hash<int>()(253)       << '\n'
     << setw(16) << hash<int>()(-253)      << '\n'
     << setw(16) << hash<double>()(253.0)  << '\n'
     << setw(16) << hash<float>()(253.0F)  << '\n'
     << setw(16) << hash<long>()(253L)     << '\n'
     << setw(16) << hash<unsigned>()(253U) << '\n'
     << setw(16) << hash<char>()('a')      << '\n'
     << setw(16) << hash<bool>()(true)     << '\n'
     << setw(16) << hash<string>()("253")  << '\n'
     << setw(16) << hash<string>()("")     << '\n'
     << setw(16) << hash<int *>()(new int) << '\n';
00000000000000fd
ffffffffffffff03
a6e6c311a0093ae9
3363ec8d00f382ce
00000000000000fd
00000000000000fd
0000000000000061
0000000000000001
1a5e026e774daa8e
553e93901e462a6e
0000000000eb42c0

Not everything

Not all standard types are hashable:

cout << hash<ostream>()(cout) << '\n';;  // 🦡
c.cc:1: error: use of deleted function ‘std::hash<std::basic_ostream<char> 
   >::hash()’
int a[] = {11,22};
cout << hash<int[]>()(a) << '\n';;  // 🦡
c.cc:2: error: use of deleted function ‘std::hash<int []>::hash()’
cout << hash<nullptr_t>()(nullptr) << '\n';
0

User-defined Types

It doesn’t know how to hash your types:

struct Point { float x, y; } p = {1.2, 3.4};

int main() {
    cout << hash<Point>()(p);  // 🦡
}
c.cc:4: error: use of deleted function ‘std::hash<Point>::hash()’

However, it can be taught.

User-defined Types

User-defined Types

We can create a template specialization for std::hash<Point>:

struct Point { float x, y; } p = {1.2, 3.4};

template <>
struct std::hash<Point> {
    size_t operator()(const Point &p) const {
       return hash<float>()(p.x) ^ hash<float>()(p.y);
    }
};

int main() {
    cout << hash<Point>()(p);
}
11708950365973905104

User-defined Types

Still fails; needs ==:

struct Point { float x, y; } p = {1.2, 3.4};

template <>
struct std::hash<Point> {
    size_t operator()(const Point &p) const {
       return hash<float>()(p.x) ^ hash<float>()(p.y);
    }
};

int main() {
    unordered_set<Point> us;
    us.insert(p);  // 🦡
}
In file included from /usr/local/gcc/11.2.0/include/c++/11.2.0/string:48,
                 from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/locale_classes.h:40,
                 from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/ios_base.h:41,
                 from /usr/local/gcc/11.2.0/include/c++/11.2.0/ios:42,
                 from /s/bach/a/class/cs000/public_html/pmwiki/cookbook/c++-includes.h:5,
                 from <command-line>:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h: In instantiation of ‘constexpr bool std::equal_to<_Tp>::operator()(const _Tp&, const _Tp&) const [with _Tp = Point]’:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:1614:   required from ‘bool std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::_M_equals(const _Key&, std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code, const std::__detail::_Hash_node_value<_Value, typename _Traits::__hash_cached::value>&) const [with _Key = Point; _Value = Point; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code = long unsigned int; typename _Traits::__hash_cached = std::__detail::_Hashtable_traits<true, true, true>::__hash_cached]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:1819:   required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_before_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr = std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<Point, true> > >::__node_base*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:793:   required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr = std::allocator<std::__detail::_Hash_node<Point, true> >::value_type*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:2084:   required from ‘std::pair<typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator, bool> std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_insert(_Arg&&, const _NodeGenerator&, std::true_type) [with _Arg = const Point&; _NodeGenerator = std::__detail::_AllocNode<std::allocator<std::__detail::_Hash_node<Point, true> > >; _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; typename _Traits::__constant_iterators = std::__detail::_Hashtable_traits<true, true, true>::__constant_iterators; std::true_type = std::integral_constant<bool, true>]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:843:   required from ‘std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::insert(const value_type&) [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type = std::pair<std::__detail::_Node_iterator<Point, true, true>, bool>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::value_type = Point]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/unordered_set.h:422:   required from ‘std::pair<typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator, bool> std::unordered_set<_Value, _Hash, _Pred, _Alloc>::insert(const value_type&) [with _Value = Point; _Hash = std::hash<Point>; _Pred = std::equal_to<Point>; _Alloc = std::allocator<Point>; typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; std::unordered_set<_Value, _Hash, _Pred, _Alloc>::value_type = Point]’
c.cc:12:   required from here
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h:356: error: no 
   match for ‘operator==’ in ‘__x == __y’ (operand types are ‘const 
   Point’ and ‘const Point’)

User-defined Types

Now, unordered_set works with a Point:

struct Point { float x, y; } p = {1.2, 3.4};

template <>
struct std::hash<Point> {
    size_t operator()(const Point &p) const {
       return hash<float>()(p.x) ^ hash<float>()(p.y);
    }
};

bool operator==(const Point &a, const Point &b) {
    return a.x==b.x && a.y==b.y;
}
// or could’ve specialized std::equal_to<Point>

int main() {
    unordered_set<Point> us;
    us.insert(p);
}

The Rules