|
|
|
nbrFactorsOf()
.
I was recently looking for the Wikipedia entry,
"Primality test",
that I referenced in the documentation for my nbrIsPrime()
function. It's been about 3 years, so I couldn't remember the name of
the entry off the top of my head. I searched the internet for "prime
wikipedia", which took me to the main page,
"Prime number".
There is a link to the "Primality test" entry in the
"Computational
methods" section's introductory paragraphs — repeat 3 times
quickly! — but it's overshadowed by the "Trial division"
subsection which, naturally, links to Wikipedia's "Trial division" entry.
"Trial division" looked nothing at all like the page I remembered and it took me a while to realize the entry was about integer factorization, not primality testing. I was intrigued by the C++ code and wondered if it is vulnerable to possible overflows. No surprise, it is.
The C++ TrialDivision()
function was introduced by a single phrase:
Below is version C++ (without squaring f)
The workings of the function are not documented and are not obvious.
To figure out how it works, I had to figure out the meaning of
ac
, temp
, and the magic number 8. Here's
a version of the function with meaningful variable names and some
meaningful comments:
template <class T, class U> vector<T> TrialDivision(U n) { // The difference between the squares of two consecutive // odd factors, f and f+2, is (f+2)^2 - f^2 = 4f + 4. // The differences, 4f+4, increase as the factors // increase, but the difference between any two // consecutive differences, 4f+4 and 4(f+2)+4, is constant: // 4(f+2) + 4 - (4f + 4) = 4f + 8 + 4 - 4f - 4 = 8. const T DIFF_4FP4 = 8; vector<T> v; while (n % 2 == 0) // Check for factors of 2. { v.push_back(2); n /= 2; } while (n % 3 == 0) // Check for factors of 3. { v.push_back(3); n /= 3; } T factor = 3; T square = factor * factor; T diff = 4*factor + 4; // Difference between 3^2 and 5^2. factor = 5; do { square += diff; // Overflow possible!!! if (square > n) break; // All factors tested? if (n % factor == 0) { // Found a factor? v.push_back(factor); n /= factor; square -= diff; // Try factor again, // so revert to factor^2, // which might have overflowed!!! } else { // Not a factor, so advance factor += 2; // to the next odd number. diff += DIFF_4FP4; // Overflow possible!!! } } while (1); if (n != 1) v.push_back(n); return v; }
To avoid multiplying the factor by itself on every factor tested,
the function keeps track of the squares of the odd factors
(ac
/square
)
and the differences between the squares of consecutive odd factors
(temp
/diff
, incremented by 8 —
see the DIFF_4FP4
comment — on each iteration).
For example, assuming no repeating factors, here are the variable values in the first few DO-WHILE iterations:
FACTOR SQUARE DIFF ------ ------ ---- [Implicit previous factor: 3 9 16 -- 3 is f; 9 is f^2; 16 is 4f+4 or 4(3)+4. ] 5 25 24 -- 5 is f; 25 is f^2; 24 is 4f+4 or 4(5)+4. 7 49 32 9 81 40 11 121 48 ... |
The phrase, "in the ... iterations", is purposeful and the table
above is for illustrative purposes. In the actual code, when unsuccessfully
testing consecutive odd factors, factor
is the only up-to-date
variable at the very beginning of the loop; square
and
diff
lag behind, still based on the previous factor. For example,
at the top of the loop on the very first iteration, factor
is the
correct 5, but square
and diff
are based on the
previous factor, 3, and thus have the values 9 and 16, respectively. The
latter two get updated within the loop. If a test is successful, the same
factor must be tried again; in this case, the loop preserves the lagging
values of square
and diff
and starts the next
iteration in the same state as that at the beginning of the current iteration.
For example, if a loop begins with 11/81/40 and 11 is found to be a factor,
the next iteration also begins with 11/81/40 in order to test 11 again.
(The implementation of the "diff" algorithm in my factorization test and
compare program, frize,
moves the updating of both square
and diff
to the
bottom of the loop, so all 3 variables are up-to-date at the top of the loop,
as seen in the table above.)
There are two sources of possible overflow: (i) adding
diff
to square
and (ii) incrementing
diff
by 8. And the two sources interact badly. The
overflows begin occurring when factoring very large prime numbers.
A prime number has exactly one factor, itself, so the testing loop will
continue until the square root of the prime number is tested as a factor.
To break out of the loop, the factor being tested must have a square
greater than the prime number (square > n
) and that's when
overflows become possible.
To illustrate, let's use unsigned 32-bit integers. First, some preliminaries:
Largest 32-bit value: 4,294,967,295 (0xFFFFFFFF) Square root of above: 65,535.999... Integer square root: 65,535 (0xFFFF) Square of 65,535: 4,294,836,225 |
Note that the real square root of the maximum representable value is a shade under 65,536 and the integer square root is 65,535. The square of the latter is 4,294,836,225, which is 131,070 less than the largest value. The 5,853 prime numbers in that top range cannot be factored without overflow happening because the square of the factor being tested must exceed the prime number. When 65,535 is tested as a factor of the prime number, its square is less than the prime number, so the loop must compute the square of the next factor, 65,537. This square overflows the maximum representable 32-bit value.
As a result and although the odd factors steadily increase by 2, the square
has suddenly dropped down to a low number and must begin anew the Sisyphean
climb to the top. Square
now has no relation to
factor
or diff
.
Square
is again less than n
, so the function keeps
testing new factors even though they're past the square root mark. Since
diff
was not reset, square
will move through the
32-bit integers even faster on the second pass until it overflows again.
The cycle will repeat itself over and over. Meanwhile, factor
keeps incrementing by 2 and diff
by 8. As diff
gets very large, it begins interacting with even small values of
square
, so that adding diff
to square
causes square
to overflow. It gets to the point where
square
is overflowing on every iteration of the testing loop.
At times, the overflows will produce descending values of square
.
We're not done yet. Eventually diff
reaches 4,294,967,288,
seven less than the maximum representable 32-bit value. When 8 is added
to diff
, diff
conveniently overflows to zero.
Square
can now take its time running through the integers
before overflowing.
At most, the function should test about 32,767 (64K/2) odd factors for
any 32-bit number. If data type T
is wider than data type
U
(the TrialDivision()
template parameters),
the function will not exceed the test limit (32K if U
is
an unsigned 32-bit integer). Type T
is used for internal
calculations; type U
is the type of the number to be factored.
For example, if T
is 64 bits and U
is 32 bits,
no overflows will occur. Even declaring T
as
uint32_t
(32 bits of precision) and U
as
int32_t
(31 bits of positive precision) will prevent
overflows. The trade-off is that narrowing type U
narrows the range of numbers that can be factored.
When types T
and U
are equal in width, overflows
are possible and can lead to pathological behavior of the function. To wit:
Largest 32-bit prime: 4,294,967,291 (4 less than UINT32_MAX) # of odd factors tested: 2,147,483,644 (not including test of 2) # of SQUARE overflows: 1,073,741,824 # of DIFF overflows: 3 |
In this case, the test loop is terminated when factor
equals
n
(i.e., 4,294,967,291) and factor % n
is zero,
not when square > n
. Thus, the number of factors tested
(2 and odd) is n/2
. Diff
overflows occur when
advancing to factors 1G-1, 2G-1, and 3G-1, where G is in the sense of
powers of 1024. The diff
overflows are expected on multiples
of 1G because the number of odd factors tested in a range of 1G is 1G/2 and
diff
is incremented by 8 on each test, so diff
reaches the overflow value at (1G/2)×8 = 4G, one more than UINT32_MAX.
(Because the input prime is so large in the example above, when the loop
terminates, the loop is only several tests shy of a fourth diff
overflow.)
The first improvement to TrialDivision()
was to get rid of
the temp
/diff
variable. The formula for the
next square
is easy enough and fast enough to eliminate
the need for and presence of a second source of overflow:
template <class T, class U> vector<T> TrialDivision(U n) { vector<T> v; while (n % 2 == 0) // Check for factors of 2. { v.push_back(2); n /= 2; } T factor = 3; T square = factor * factor; while (square <= n) { // Until all factors are tested. if (n % factor == 0) { // Found a factor? v.push_back(factor); n /= factor; // Leave square alone and // try factor again. } else { // Not a factor, so advance // to the next odd number. // (f + 2)^2 = // f^2 + 4f + 4 = // f^2 + 4(f + 1) factor++ ; // Overflow possible next!!! square += factor * 4 ; factor++ ; } } if (n != 1) v.push_back(n); return v; }
The code is cleaner now. The next square is calculated simply by adding
(factor + 1) * 4
to the current square; an optimizing compiler
will replace the multiplicaton with a left shift of two bits. There is
still the same possibility of square
overflowing under the
same circumstances. Assuming that overflow causes the value of a variable
to "wrap around", it is easy enough to detect just by comparing the old
and new values of square
. If the new value is less than the
old value, then overflow occurred.
With this code in hand, an optimization immediately comes to mind.
TrialDivision()
checks every odd number from 3 on up
as a factor. Borrowing the 6k±1 optimization from
primality testing and skipping odd
factors divisible by 3, the time to factor a number is reduced by up
to a third. Doing this also decreases the range of numbers that can
be factored which, again, is attributable to overflow. In the example
presented earlier, the top 5,853 32-bit unsigned prime numbers cannot
be factored without overflow; that's with the algorithm that tests 2
and all odd factors. The 6k±1 optimization increases
that number to the largest 17,565 primes. (This increase is because
the maximum square root is 65,535, which happens to be divisible by 3.
When checking all odd factors, the algorithm will test factors up to
and including 65,535. With the 6k±1 optimization, the
highest factor tested drops back to 65,533 and the next factor tested,
as it must be, is 65,537, whose square overflows. Consequently, the
17,565 primes greater than or equal to 65,5332 [4,294,574,089]
cannot be factored without the square overflowing.)
Eliminating the possibility of overflow and restoring the range of
numbers that can be factored is acomplished by eliminating the use
of squares. A fast integer square root
function can very quickly calculate the square root (and highest
possible factor) of the input number if the number is (or can be
converted to) an unsigned integer with an even number of bits.
My C implementation of TrialDivision()
calls
nbrUnsignedSQRT()
,
which is in "nbr_util_roots.c
".
There is an implicit limit on the number of factors tested in the squares
version of TrialDivision()
that I did not realize at first when
I wrote up the square root version of TrialDivision()
. Initially,
the target number is the original input number. Whenever a factor
is found (including 2s and 3s), the target number is reduced (divided) by
the factor. Consequently, the square <= n
loop test in the
function above is actually testing two limit conditions:
If the original input number is composite, the target number
is reduced by each factor until the target is reduced to 1
after the last factor. At this point, there is no need to
test factors larger than the last factor, so the loop can
terminate. And in the squares version of TrialDivision()
,
the loop does so because square > 1
.
If the original input number is prime, no factors will be found
and the target will remain equal to the original input number.
To ascertain that there are no factors and that the number is
prime, it is necessary to check all the possible factors up to
the square root of the number (i.e., square <= n
).
Limit #2 is a hard limit needed for prime numbers. Limit #1 is just an optimization. When the target number is reduced to 1, it is perfectly acceptable, albeit time-consuming, to continue unsuccessfully testing factors up to the square root.
Here's the square root version of TrialDivision()
that
(i) eliminates the possibility of overflow, (ii)
implements the 6k±1 optimization, and (iii)
includes both limit tests:
template <class T, class U> vector<T> TrialDivision(U n) { vector<T> v; while (n % 2 == 0) // Check for factors of 2. { v.push_back(2); n /= 2; } while (n % 3 == 0) // Check for factors of 3. { v.push_back(3); n /= 3; } T factor = 5; T root = (T) UnsignedSQRT (n); bool byTwo = true; // Increment factor by 2 or by 4. while ((n > 1) && // Until n is reduced to 1 ... (factor <= root)) { // ... or all factors are tested. if (n % factor == 0) { // Found a factor? v.push_back(factor); n /= factor; // Try factor again. } else { // Not a factor. if (byTwo) factor += 2; // Advance to next odd factor. else factor += 4; // Skip odd factor divisible by 3. byTwo = !byTwo; // Alternate between 2 and 4. } } if (n != 1) v.push_back(n); // N not reduced to 1 means N is prime. return v; }
Not being a big fan of C++, I implemented this last version in C and
called it nbrFactorsOf()
. The function uses a single type
for both the input number and internal calculations, factor_t,
which is defined in "nbr_util.h
". Factors of the input
number are stored in an array of factor_t
s supplied by the
caller. (The maximum number of factors that may be returned is w-1,
where w is the bit width of factor_t
. In other words,
the number of factors increases as the magnitudes of the factors decrease.
The smallest factor is 2, so the largest power of 2 representable in
factor_t
will have the maximum number of factors:
21×22×...×2w-1.
If factor_t
is 32-bits wide, the largest power of 2 is
231 (0x80000000) and there can be at most 31 factors;
64 bits will have at most 63 factors.
Two additional optimizations were added to the C function:
A list of the first M prime numbers was added: 2, 3, 5, 7, 11, ...
These numbers are tested as factors at the very beginning of
nbrFactorsOf()
, thus giving the function a headstart
by not having to test non-prime odd factors in the range covered by
the list. M is a compile-time constant, NBR_EARLY_PRIMES
,
and there is the obvious trade-off between speed and the memory
footprint of the list.
When a factor is found and the target number is reduced, the square root of the target number is recalculated.
To save you the trouble of reading the analysis below, here's the short summary:
The timings of the different algorithms I implemented were comparable.
The 6k±1 optimization had the biggest impact on all the algorithms, reducing their times by one-third.
Computing the square root at the outset offers a slight improvement in timing, but the big gain is that there are no overflows and, therefore, all of the primes which fit into the chosen integer data type can be factored.
18,446,744,065,119,617,025 is a large, 64-bit number with 10 factors:
3×3 × 5×5 × 17×17 × 257×257 × 65537×65537
LIBGPL function nbrFactorsOf()
quickly tests the first 512,
precomputed prime numbers as factors, up through 3,671. The function then
drops into the middle of the 6k±1 cycle and begins with 3,673
testing odd numbers not divisible by 3 as factors. Testing the early primes
as factors catches 8 out of the 10 factors of 18,446,744,065,119,617,025,
leaving only 65,5372 to be factored.
The algorithms implemented in and tested by the frize program all have this early-prime optimization available to them, but, in the timing runs I present below, the optimization is disabled on the command line.
In the timed tests that follow, frize factors the big number above 15,000 times in order to get large enough timing measurements. The command line used for each run was:
% frize -early 0 -time -repeat 15000 18446744065119617025,algo
The different algorithms performed as follows on this factorization (the "root" and "root3" figures include the number of fast integer square roots calculated during the factorization):
"diff": 21.3 seconds 32,777 factors tested "square": 20.9 seconds 32,777 factors tested "square3": 14.1 seconds 21,855 factors tested "root": 19.6 seconds 32,777 factors tested 10 roots "root3": 13.4 seconds 21,855 factors tested 8 roots "nbr": 12.6 seconds (with 512 early primes) |
The elapsed times for the "diff", "square", and "root" algorithms are close, as are the times for the "square3" and "root3" algorithms. The 6k±1 optimization obviously has the biggest effect since it reduces the number of potential factors tested by a third.
Testing the first 512 primes, nbrFactorsOf()
is slightly
faster than "square3" and "root3"; however, it only takes about 130
tests at most for the other algorithms to reach the 7th and 8th factor,
257. If the "-early 0
" option is removed from the command
line, frize will test up to 6,542 prime numbers as factors,
ranging from 2 to 65,521. The last number is the largest, unsigned
16-bit prime and is 16 less than the last two factors, 65,537.
Consequently, only a few extra odd numbers have to be tested as factors
and all the algorithms factor the input number 15,000 times in 4 seconds
or less! (Except for the "nbr" algorithm which only tests 512 early
primes anyway.)
18,446,744,047,939,747,781 is the largest prime number that the "square3" algorithm can factor without 64-bit overflow. The number has only one factor: itself. It takes about a minute or more to factor the number once, so the frize command line for only one factorization instead of 15,000 is:
% frize -early 0 -time 18446744047939747781,algo
And the timing results are:
"diff": 93.2 seconds 2,147,483,646 factors tested "square": 91.0 seconds 2,147,483,646 factors tested "square3": 61.4 seconds 1,431,655,765 factors tested "root": 85.6 seconds 2,147,483,646 factors tested 1 root "root3": 58.5 seconds 1,431,655,765 factors tested 1 root "nbr": 55.0 seconds (with 512 early primes) |
Again, the times are comparable among the algorithms, the 6k±1 optimization had the biggest effect on performance, and the root methods aren't subject to overflow.