Primality Testing

WARNING: All over the internet, you'll find example code that determines if a number is prime. Nearly all the examples I've seen loop through divisors while "i×i ≤ n" (i.e., while the square of the next divisor to be tested is less than or equal to the number). Squaring the divisor is fine if you're using arbitrary-precision arithmetic, but it can fail at the high end if you're using fixed-size integers. Take unsigned 32-bit numbers for example:

       ~65535.9^2 = 4,294,967,295 (maximum unsigned 32-bit number)
        65535.0^2 = 4,294,836,225

There are 5,853 prime numbers between 65535.0² and ~65535.9². When checking the next divisor after 65,535 for any of these primes, the i×i computation will overflow, leading to meaningless, possibly infinite loops. This is discussed in more detail in the "Limit Checking" section below.

NOTE: This prime number package is GROSSLY OVERENGINEERED. I began with an optimization that skipped testing potential divisors divisible by 2 or by 3 in a small, tight loop. I then got to wondering about skipping potential divisors divisible by an even wider range of prime divisors; e.g., skip divisors divisible by 2, 3, 5, 7, or 11. I generalized the code to handle an arbitrary range, [2..], of prime divisors. When skipping divisors divisible by 2 or by 3, the generalized code is virtually as fast as the hard-coded loop.

NOTE: I was REINVENTING THE WHEEL. I basically had and have a grade-school knowledge of prime numbers. After writing the code, I looked up on the internet fast ways of determining primes and I learned that "my" optimization was widely known (see the mathematics section below). So perhaps this was all a waste of time, but the open-ended exploration of the subject on my own was very intellectually satisfying for me.

NOTE: If a SMALL MEMORY FOOTPRINT is important, this package can be compiled to produce compact object code that uses the original hard-coded loop and omits the sequence-of-differences functions and data related to the generalized version of the code. To accomplish this, add

    -DNBR_SKIP_2_3_HARDCODED=1

to the compilation command for "nbr_util.c". If an application only calls nbrIsPrime() and/or nbrNextPrime(), its code does not need to be compiled with the CPP definition. If, however, the application calls any of the sequence-of-differences functions, the application should be compiled with the CPP definition. Macros in "nbr_util.h" substitute meaningful return values for the omitted functions.

NOTE: To jog your memory, if necessary, a NUMBER IS PRIME if it's positive and if its only divisors are 1 and itself; zero and one by definition are not prime. To determine if a number n is prime, you basically test if the number is evenly divisible (remainder 0) by any divisor in the range 2..√n. If the number is evenly divisible by some divisor, then the number is not prime. If none of the divisors evenly divide the number, then the number is prime.

I wrote my HASH_UTIL hash table package in the late 1980s, as I recall. It has long been a nuisance that a program using the package has had to be explicitly linked to the C math library ("-lm"). So, about 30 years later, I decided to fix that! Thinking that the math library was required because of the use of sqrt(3) in (i) determining the next prime number larger than the expected table size and (ii) computing the standard deviation of the bucket lengths, I dove into the code. My memory hadn't served me well, because I discovered that the prime number function was not using sqrt(), but was instead iterating until:

I don't know if I was considering the possibility of overflow at the time I wrote the hash table package, but it was a fortuitous choice for the loop test.

So far, so good, but I was red-faced to see that for 3 decades my code had been checking each and every possible divisor from 2 on up: 2, 3, 4, 5, 6, 7, ... Skipping all the even numbers after 2 was a painfully obvious optimization. And then I began thinking about also skipping odd numbers divisible by 3: check divisors 5, 7, ... skip 9 ..., 11, 13, ... skip 15 ..., 17, 19, etc. I noticed the differences between divisors to be tested was 2, 4, 2, 4, etc. In other words, a repeating cycle of two divisors. I implemented a hard-coded loop ("divisible" means evenly divisible with a remainder of 0):

which checks the following divisors: [2, 3,] 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, 37, ... Skipping odd divisors divisible by 3 is almost twice as fast as checking all of the odd numbers.

I was pleased by my independent "discovery" of this heretofore unthought-of optimization (/snark) and my thoughts naturally turned to skipping even more divisors.

Terminology

Skip-Through Prime (STP) - is the maximum prime number for which divisors divisible by one or more of primes 2 through STP are skipped. In the pseudocode above, the skip-through prime is 3. As another example, if the skip-through prime is 11, nbrIsPrime() will skip divisors divisible by any of the primes 2 through 11: 2, 3, 5, 7, 11.

Start Divisor - is the divisor at which to begin the first cycle of testing. This is always the next prime after the skip-through prime. Using the examples above, STP 3's start divisor is 5 and STP 11's start divisor is 13.

Sequence of Differences - is an array of the differences between successive divisors to be tested. The differences skip over divisors that do not need to be tested; i.e., divisors evenly divisible by primes 2 through the skip-through prime.

Beginning with the start divisor, nbrIsPrime() tests if the target number is evenly divisible by the start divisor. If so, the number is not prime and nbrIsPrime() returns. If not, the first difference in the sequence is added to the start divisor to get the next divisor to test. The function continues testing successive divisors; when the end of the sequence is reached, nbrIsPrime() jumps back to the beginning of the sequence to continue with the next pass through the sequence.

For example, in the pseudocode above, the sequence of differences for STP 3 is { 2, 4 }. Beginning with start divisor 5, the loop iterations unfold as follows:

    (1) Test  5, Add 2, Test  7, Add 4
    (2) Test 11, Add 2, Test 13, Add 4
    (3) Test 17, Add 2, Test 19, Add 4
    (4) Test 23, Add 2, Test 25, Add 4
    (5) Test 29, ...

Limit Check - refers to checking if a divisor about to be tested is less than or equal to the square root of the target number. When testing all odd divisors (skip-through prime 2), there is only one test in the loop and the limit check is performed as part of the loop control. For skip-through primes 3 and above, there are multiple tests per loop iteration, corresponding to each difference in the prime's sequence. STP 5 has a sequence of 8 differences, for example, but adding 7 limit checks inside the loop incurs a performance hit. (The pseudocode above doesn't limit check the second test's divisor in the loop.) I discuss this issue in more detail below, "Limit Checking".

Generalized Algorithm

The pseudocode for the generalized implementation inserts an inner loop that steps through the sequence of differences (again, "divisible" means evenly divisible with a remainder of 0):

Figuring out the sequence of differences for skip-through prime 3 was easy enough to do in my head. I wrote a throw-away program to generate differences for skip-through primes 5, 7, and 11; examined the listings to determine when the sequences began repeating (8, 48, and 480, respectively); and noticed the first of a few relations I would find:

seq-length (next-prime) = seq-length (current-prime) × (next-prime - 1)

For example:

    Skip-through prime 2's sequence of differences, { 2 }, is of length 1.
    STP 3 has a sequence of length 2 = 1 × (3 - 1).
    STP 5 has a sequence of length 8 = 2 × (5 - 1).
    STP 7 has a sequence of length 48 = 8 × (7 - 1).
    STP 11 has a sequence of length 480 = 48 × (11 - 1).
    And so on.

Sequence length, for the skip-through primes under consideration (2 through 53 for 64-bit unsigned long integers), increases by one or two orders of magnitude for each new prime. STP 59's sequence length exceeds the maximum representable, unsigned 64-bit value.

range (next-prime) = range (current-prime) × next-prime

The range of a sequence of differences is the range of divisors covered by each pass over the sequence, found by summing all the differences in the sequence. STP 3's sequence is { 2, 4 }, whose range is 2 + 4 = 6; as an example above shows, the first divisor tested in each loop iteration is 5, 11, 17, 23, 29, ... STP 5's 8-element sequence has a range of 30, so the divisors at the beginning of each loop iteration are 7, 37, 67, 97, ... I wasn't clever enough to realize that the formula above is more simply expressed as the product of the preceding primes and the skip-through prime; e.g., the range of STP 7 is 2 × 3 × 5 × 7 = 210. Although STP 53's sequence length fits in an unsigned 64-bit integer, its range doesn't.

density (prime) = seq-length (prime) / range (prime)

Density, in this case, is a measure of how many numbers in a range are actually tested as divisors. The lower the density the better. If even numbers are skipped and odd numbers are tested (STP 2), the density is 50%. Higher skip-through primes lower the density even further at the expense of increasingly lengthy sequences of differences.

max-difference (next-prime) = current-prime × 2

Sometimes! The maximum differences in the sequences for skip-through primes [2,] 3, 5, 7, 11, 13, 17, 19, and 29, 31 obey this relation. For example, the maximum difference in STP 31's sequence is 29 × 2 = 58. However, skip-through prime 23's maximum difference is 40 (not the expected 19×2 = 38) and STP 37's is 66 (not the expected 31×2 = 62). Using nbrPrimeSeqDiffsMax() to generate the one trillion (10¹²) differences in STP 37's sequence and determine the maximum difference originally consumed 99%-100% of the CPU for 8.5 days. I subsequently optimized the function by skipping both even divisors and odd divisors divisible by 3, reducing the time to 6.4 days. I'm not a glutton for punishment, so I did not give serious consideration to optimizing nbrPrimeSeqDiffs() and nbrPrimeSeqDiffsMax() further by using higher skip-through primes than 3. In any case, finding out STP 41's maximum among its 44 trillion differences would take three-quarters of a year or more. Maybe there's a mathematical relation that gives the correct maximum difference for all skip-through primes, but nothing jumped out at me.

Knowing a sequence's maximum difference--and using differences in the first place--can significantly reduce memory usage. For unsigned 64-bit integers, the largest skip-through prime, 47, has an estimated maximum difference of 43 × 2 = 86, which can be represented in an unsigned 8-bit integer. STP 47's sequence of differences requires only one-eighth of the memory that the sequence of offsets in the math section below requires. (Actually, STP 53 is the largest skip-through prime in the differences form of the algorithm. It can't be used in the math section's offset form of the algorithm. The last offset for any STP is always one less than the STP's range and STP 53's range exceeds the maximum representable, unsigned 64-bit integer by 14×10¹⁸.) Looking at unsigned 128-bit integers, the largest skip-through prime is 101 with an expected maximum difference of 202, which, if correct, can still be represented in an unsigned 8-bit integer. (This is all theoretical for the most part; the largest skip-through prime I've actually tested was 29 with its one-billion-element sequence.)

Time for some actual numbers to get a sense of what we are dealing with. The following numbers are for unsigned 64-bit integers (L/R is the length divided by the range; i.e., density):

(The primate program computes the range as both an integer and a floating-point number. If the integer overflows, the floating-point number is substituted in the density calculation. So the density for skip-through prime 53 is correctly listed as 13%.)

For unsigned 128-bit integers, the following figures pick up with skip-through prime 53 (whose range overflowed 64 bits) and run through skip-through prime 101. The density at this point has only dropped down to 11.9%. (The sequence length and range both overflow for STP 103.)

Knowing the exploding lengths and ranges for the higher skip-through prime, whether for 64-bit or 128-bit numbers, was of intellectual interest to me, but these higher skip-through primes are obviously not of any practical value. The Prime Page's Glossary entry for wheel factorization says, in different terminology than mine, that skip-through prime 251 reduces the density to 10% and 75,037 reduces it to 5%. Skip-through prime 134,253,593, with a sequence length of approximately 6.4×10⁹⁹, drives the density down to 3%!

Limit Checking

"Limit checking" refers to not testing divisors greater than the square root of the number when determining if the number is prime.

First of all, heed my warning at the very beginning about the danger of overflow using the widely seen i×i ≤ n check with fixed-precision integers. The following explanation uses 64-bit unsigned longs, but the warning applies to any bit width.

The integer square root of the maximum representable, 64-bit value, ULONG_MAX, is 0x0FFFFFFFF — all of the lower 32 bits set. (The real square root of ULONG_MAX is 0x0FFFFFFFF plus about 0.9 decimal.) If you square the integer square root, the result is a number that is 8.6 billion less than ULONG_MAX. In other words, the 8.6 billion numbers in the range [root²..ULONG_MAX] all have the same integer square root: 0x0FFFFFFFF. There are 193 million primes in that range. Consequently, for each of the 193 million primes tested with divisor i:

So 193 million primes are excluded from being tested because of the i×i multiplication. The steps above will occur when testing every possible divisor (i++) or only odd divisors (i += 2).

Additional optimizations in the algorithm make the situation even worse. If the algorithm is extended to also skip divisors divisible by 3 (my first optimization), the divisor is incremented alternately by 2 and by 4. Since 0x0FFFFFFFF is divisible by 3 and will be skipped, i will only reach the preceding odd number, 0x0FFFFFFFD, and then be incremented by 4 to 0x100000001, causing i×i to overflow. The square of the preceding odd number is 25.8 billion less than ULONG_MAX. Testing the primality of the 579 million primes in that range — those whose integer square root is 0x0FFFFFFFD, 0x0FFFFFFFE, or 0x0FFFFFFFF — is impossible.

To avoid overflow with fixed-precision integers, divide both sides of the relational operator by i:

Now that there is no danger of overflow, there are some other aspects of limit checking in relation to my skip-through prime optimization that need to be examined. Remember that the algorithm is basically two nested loops, where the inner loop runs through the elements in the sequence of differences and the outer loop repeats the inner loop as often as necessary to reach and pass the square root of the number (if the number is prime):

In the Wikipedia pseudocode for the related c#k+i algorithm (see the mathematics section below), the "inner loop" corresponding to my inner loop tests the same number of divisors as mine does, but without limit checking the divisors. (I put "inner loop" in quotes because the pseudocode is hard-coded for the STP 3 optimization and the "loop" is a single IF-statement that checks both the 6(k-1)+5 and "6k+1" divisors; again, the c#k+i version of my STP 3 optimization is explained in the mathematics section.) It is a good idea, however, to limit check each divisor inside the inner loop as well as the outer loop:

(Whether the check is at the beginning or end of the inner loop, there will always be one check that duplicates the check in the outer loop on each pass through the sequence.)

Because the length of the sequence and range increase rapidly with higher skip-through primes, there are two cases that are of concern when there is no limit check in the inner loop:

The first case above is a matter of efficiency; the second case requires flagging a number as composite if "n%div == 0" and if "n != div".

Adding the limit check, "div > n/div", to the inner loop caused a big performance hit. My first attempt at ameliorating the slow-down was to add a global boolean variable that the calling program could set to enable or disable limit checking in the inner loop:

Testing a boolean flag takes no time at all and, when limit checking is disabled, the algorithm is as fast as the original no-limit-checking algorithm. However, I was not satisfied with burdening the calling program with the responsibility of deciding whether to turn limit checking on or off.

My second attempt took advantage of a sequence's range to put off limit checking until the tested divisor got "close" to the square root of the input number. First, I computed a rough estimate of the square root by repeatedly dividing the number by 2 until "root ≤ n/root". Then, in the outer loop, before beginning the inner loop each time, figure out if the rough root estimate falls within the range of numbers covered by the coming pass through the sequence of differences. If not, clear a local boolean variable to disable limit checking in the inner loop. If the estimated root does fall in the range, set the boolean variable to enable checking.

A more accurate estimate of the square root was needed, so I searched for "fast integer square root" on the web and, not unsurprisingly, up popped numerous references to a technical bulletin, "Fast Integer Square Root", by Ross M. Fosler. (I added an "s" to the title for my own description of the very fast algorithm and how I implemented it, "Fast Integer Square Roots".) My is-prime function now knows exactly when the tested divisors pass the actual square root of the number and the function no longer needs to perform the time-consuming "div > n/div" check in the inner loop:

Performance

Performance tests were conducted using my primate test program. The command lines are shown below. The "output" following the command lines are not the output of the program, but me tabulating the results of multiple runs. When primate is in "-time" mode, most output is suppressed so that I/O is not a significant factor in the times. Also, the generation of sequences of differences is done before the timer starts counting.

My original prime number function was in my HASH_UTIL package and the largest hash table I ever needed was for about 80,000 telemetry points. (Most of those points were variable names tied to memory locations in the onboard computer and would only be downloaded, selectively, when trying to diagnose a problem.) So my first suite of tests listed all the prime numbers in the range 80,000-200,000. Since listing those numbers is very quick, I had each run repeat the test 1000 times to get a large enough timing figure for comparison's sake.

The second suite of tests determines the 1,000 largest prime numbers representable in unsigned, 64-bit integers. These numbers are verified by matching them against a pre-computed list of the primes generated by a different program, Achim Flammenkamp's prime_sieve, found at his web page:

[Modifications to his program: (i) define LONG as "unsigned long", (ii) undefine _ANSI_EXTENSION, (iii) uncomment the commented-out code in the definition of "use(x)", and (iv) hard-code the initial sieve size.]

The most time-consuming aspect of my algorithm is generating the sequence of differences for the higer skip-through primes. My computer was able to generate a physical, 1-billion-byte sequence of differences for STP 29, but malloc(3) failed trying to allocate the 31-billion-byte sequence for STP 31. The nbrPrimeSeqDiffsMax() function generates the same sequences of differences, but without requiring any memory for the sequence: each difference is generated, compared to the running maximum value, and then discarded. So I was able to time sequence generation for STPs 31 and 37; the last took 6.4 days and I didn't have the patience to wait 280+ days for the generation of STP 41's 44-trillion-element sequence to complete.

Mathematically Speaking

When the code was pretty much in its final form, bar tweaking of the limit checking, I researched ways to speed up generating sequences of differences (e.g., originally 8.5 days for skip-through prime 37's one trillion differences, a 10¹²-length sequence). This research was unsuccessful, but I came upon many queries on programming forums about determining if a number is prime and many responses about an optimized method called "6k +/- 1". I then searched for information about 6k +/- 1 and landed on the Wikipedia page about primality testing using "trial division", i.e., looping through all possible divisors to find a divisor other than 1 for the number. From Wikipedia (Wikipedia: Primality test - Simple methods):

So all primes are of the form c#k+i for i meeting the conditions above. (Note that the reverse is not true; numbers of the form c#k+i are not necessarily prime.) Relating this to my code:

c is my skip-through prime. (Strictly speaking, c can be any number in the range STP ... next-prime - 1; in other words, STP is the largest prime less than or equal to c.)

c# is "c primorial", where primorial is simply the product of all the primes less than or equal to c; i.e., 2x3x5x7x... This value is identical to my "range" of a sequence of differences. (Wikipedia: Primorial)

k indexes the blocks of range c# of divisors to be checked. Incrementing k is like reaching the end of a sequence of differences in my code and looping back to start at the beginning of the sequence for the next higher block of divisors.

i is drawn from the set of numbers in the effective range 1..c#-1 such that each i is coprime with c#. Two numbers are coprime with respect to each other if they share no divisors other than 1. Note that "coprime" is a relation between numbers, not an attribute (like "prime") of a single number. (Wikipedia: Coprime integers)

So the c#k+i optimization is the foundation for my code, unbeknownst to me when I wrote the code! The difference between the c#k+i method and my code is that the former method works with explicit i offsets from the base of a c#k block and my code works with the sequence of differences between successive i offsets.

Now let's consider the set of numbers coprime with c# from which i is drawn. The numbers are in the range 0..c#-1. Zero (0) drops out because the first multiplier in c# is 2, so c#k+0 is always even (divisible by 2). The very first number in the set is always 1 since the only divisor it shares with c# is 1.

The second number in the set is always the next prime after c: the numbers between 1 and the next prime, exclusive, are either equal to or evenly divisible by one of the primes less than or equal to c (that's why the next prime is prime!), so these numbers drop through the "sieve". NOTE that the next prime, the second number in the set, is the start divisor in my algorithm. My sequence of differences begins with the difference between the second and third elements of the set and ends with the difference between the first and second elements.

The remaining numbers in the set are determined by checking the divisors of the numbers after the next prime. If a number is evenly divisible by any of the primes less than or equal to c, the number drops through the "sieve" and out of the set. Since the number of elements in a set rises rapidly with high c's, this is a time-consuming process. This process can itself be optimized by using the c#k+i method. My comparable generation of a sequence of differences uses a hard-coded skip-through prime of 3; as I mentioned above, it didn't seem worth the effort to generalize that bit of code for higher STPs.

Skip-Through Primes

Here are some example skip-through primes in their c#k+i form. (In the first example, 1 is not prime, but the algorithm will work with the given parameters to check every possible divisor, odd and even.)

Note that the algorithm has to ignore the very first potential divisor: when k is 0 and i is 1, c#k+i = 1 and any number being divisible by 1 does not mean the number is composite. Effectively, the first divisor checked is the next prime after c, my start divisor. (Of course, this is after the initial check if the target number is equal to or divisible by the primes less than or equal to c).

I just noticed that the difference between the first two elements in a set equals the difference between the last two elements, at least up to c=29. I don't know if there's a mathematical reason for this or if it holds true for all possible sets.

You might be wondering how the "6k +/- 1" form widely seen on the internet relates to the c#k+i scheme of 6k+i for i in { 1, 5 }. Swap the arithmetic operations and "6k -/+ 1" becomes shorthand for the +5 term of the previous k block and the +1 term of the current k block:

A graphical representation, I guess you would call it, of the c#k+i optimization is called wheel factorization (Wikipedia: Wheel factorization). Like my algorithm, the algorithm presented on the Wikipedia page (i) explicitly begins checking divisors with the next prime after c and (ii) uses the differences between divisors rather than the i offsets.

Sequence Length

As mentioned before, the sequence length is the count (or totient) of the numbers in 1..c#-1 that are coprime with c#. Euler's product formula (Wikipedia: Euler's product formula) can be used to compute the totient, phi(n), for a given number n:

which means the product of successive (1-¹/_p)s for all primes p that are divisors of n:

(Note that Euler's formula can be used for arbitrary positive integers, but what is of interest in the primality testing context is specifically when n is an instance of a c#.)

With p[k] being a prime number and p[k+1] the next larger prime number, my sequence-length formula can be derived from Euler's formula as follows:

I am not a mathematician and my heart sank after reading the first two paragraphs on totient's Wikipedia page. The rest of the page is not for the mathematically faint of heart. Fortunately, I really only needed to understand Euler's product formula; some ancillary Wikipedia pages on the notation helped me figure out how to apply the formula. I tried it on three skip-through primes:

The pattern of the calculations became obvious and I could then work out the derivation above of my formula after the fact.

Primality Testing

2018

Introduction

Terminology

Generalized Algorithm

Limit Checking

Performance

Mathematically Speaking

Introduction

Terminology

Generalized Algorithm

Limit Checking

Performance

Mathematically Speaking

Skip-Through Primes

Sequence Length