Hash Functions: the modulo prime myth?

calambrac · on June 10, 2009

For a uniformly distributed input set, it doesn't matter how many buckets you have. When you take your modulus, you're going to mix well, regardless of whether its prime or not. This is what the article (accidentally, apparently) confirms.

The win of using a prime modulus comes when you have a non-uniformly distributed input set (though of course you could craft an especially evil input set). Mod prime is just more likely to produce a more uniform mix across a larger set of input distributions.

Since we're talking about a general-purpose hash table, we can't assume we know anything about our inputs (i.e., that they're uniformly distributed), so we use a prime number of buckets just to up the odds that we come out ahead.

edw519 · on June 10, 2009

"For a uniformly distributed input set..."

A little off-subject, but another good reason to make your primary ids sequential whenever possible. This issue just goes away.

barrkel · on June 10, 2009

The article writer (Nathan) used uniformly distributed cryptographically secure random numbers as his test data. Any decent hash for integers that doesn't throw bits away is going to produce a nice even distribution for that kind of sequence no matter what the divisor.

The problem with hash tables is bad hash functions, or hash functions which perform badly on certain data, where "perform badly" means recurrence of certain patterns, usually "latent periods", i.e. too many function results that are divisible by some number or one of some set of numbers. If those numbers have factors in common with the bucket count, then there's going to be trouble.

For example, a poor hash function might turn up even numbers twice as often as odd numbers; if the bucket count is also even, then the even buckets will have 50% more collisions than the odd buckets, on average. A prime number of buckets guarantees that there's only one factor that can clash, rather than a bunch of them.

codexon · on June 10, 2009

Hello, this is the author. (btw Nathan is a commenter, not me)

It was exactly my conclusion that the hash function and data is more important than using mod prime.

Try to think of it this way. The ideal hash function takes your not so random data and outputs random data.

I = low-medium entropy data

H(I) = high entropy output equivalent to random numbers

bucket = (H(I) or random number) % m <--- experiment is here.

Now to simulate this, I used the previous model, as it is the definition of a good hash function. If you have repetitive data in, and repetitive data out, no modulus is going to help you.

So yes, the conclusion of your comment and mine are the same: Worry about your data and hash function. But many people came out of the woodwork trying to poke holes in the experiment.

barrkel · on June 10, 2009

Except for the last bit - a prime modulus ameliorates weaknesses in the hash function.

Since most hash tables are written to support user-supplied hash functions, and users need to write hash functions to support value-based equality when storing values of custom types in hash tables, it makes a lot of sense for the hash table implementer to take this into account.

Finally, there's a trade-off: sometimes a fast hash function with some weaknesses, but using a prime modulus, can be a better choice than a slower, more thorough hash function with e.g. a power of two modulus (bitwise and).

calambrac · on June 10, 2009

If you have repetitive data in, and repetitive data out, no modulus is going to help you.

Um, no. A prime modulus will often help in this case. That's the whole point.

codexon · on June 10, 2009

Do you have any statistical evidence?

calambrac · on June 10, 2009

In your test code, change your loop to:

    for r in [3, 5, 17, 257]:
        for i in xrange(n):
            num = r * i
            table_p[num%prime] += 1
            table_c[num%composite] += 1

On here and on your blog, you keep falling back to the observation that modulo prime makes no difference when you have a uniform distribution of inputs, like you get after you apply a good hash function.

Nobody who is telling you that you're wrong would dispute that. The fundamental point that you keep ignoring is that doesn't matter, because as a library writer providing a hashtable, you don't control the hash function. Having a simple way to mitigate the ill effects of users choosing bad hash functions is a good thing. Why do you keep ignoring this point?

lallysingh · on June 10, 2009

Quick point: the reason to use primes is to avoid a nasty interaction between a common fault of hash functions and the modulus. Specifically, if the hash function puts out a lot (e.g. more than appropriate for a uniform distribution) of hash values that have a common factor F, and if F also a factor of your modulus, then you've divided your number of buckets by F for those many hash values.

It seems a little obscure until we look at common factors like 2 or 4. It's pretty easy to write a hash function, that when combined with its input (say ASCII text, or addresses returned by malloc()), puts out a lot of values that divide by a power of two.

To avoid it, without doing any analysis of your hash function or input, a prime factor adds safety. Thanks to this: http://primes.utm.edu/notes/proofs/Theorem2.html you know that if n is prime, then so's 2^n-1. So, for a prime n, a cheap prime modulus for computers is to do a bitwise and with ((1 << n) -1).

vomjom · on June 11, 2009

I'm a little late to the conversation, but I'll explain the reasoning. You don't want the hash and the bucket size to have a common divisor. A prime bucket size is the best way to ensure that.

For example, let's suppose you have a bucket size of 8 and a flawed hash function that produces multiples of 12 more often than other hashes (it can even be only slightly more common, but will still have an effect). Then the 0 and 4 spots will be used more often than any other.

Obviously, if you have a super awesome hash function, then it isn't a problem. Most of the time, making a hash function with good distributive properties is hard and makes the hash function slower.

There's also a simple universal hashing solution using prime bucket sizes:

http://en.wikipedia.org/wiki/Universal_hashing

http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Compute...

sfphotoarts · on June 10, 2009

very interesting, and worth reading all the way up to and including the comments. Nathan appears (in the comments) to counter the conclusions, but merely sketches out why, rather than give anything concrete.

Then again, it would seem odd that Knuth was not only wrong, but that nobody in 30 years of CS has spotted the error.

I'm going to continue to use primes I think, just in case.

kylec · on June 10, 2009

It's been a few years since abstract algebra, but might it have something to do with finite fields? I seem to recall that modulo primes held special significance (i.e. you can't have a finite field size other than a power of a prime).

stratomorph · on June 10, 2009

I don't understand what that graph is showing. Which variable is independent, or supposed to be? It can't be buckets as a function of collisions. At first I thought he had his axes reversed (from the customary y as a function of x), but it's not a graph of collisions as a function of buckets, either, because reading it that way implies that a very small number of buckets produces a small number of collisions, which is obviously false with a hundred million inputs being distributed over ten or less buckets.

It looks like a graph of the moral equivalent of 1=1, by numeric methods. What am I supposed to see there?

calambrac · on June 10, 2009

It's a histogram. It's counting the number of buckets (y axis) for each number of collisions (x axis). He's showing that the distribution is the same for both methods; if one had a clear advantage, it should have a relatively more uniform distribution.

The problem is that the input distribution is uniform, so you'd expect a pretty decent distribution of collisions regardless of what your modulus was. A prime modulus only helps for non-uniformly distributed inputs.

tybris · on June 10, 2009

You use a prime to be able to reach all slots in open addressing if you use a linear probe bigger than 1. Can't think of any other reason.

illumen · on June 10, 2009

Use a critbits/radix tree instead of hash tables. They are just better.

http://cr.yp.to/critbit.html http://en.wikipedia.org/wiki/Radix_tree