- Related functions and programs:
- NBR_UTIL - integer
factorization function; i.e.,
`nbrFactorsOf()`

. **frize**- integer factorization test program.

I was recently looking for the Wikipedia entry,
"Primality test",
that I referenced in the documentation for my `nbrIsPrime()`

function. It's been about 3 years, so I couldn't remember the name of
the entry off the top of my head. I searched the internet for "prime
wikipedia", which took me to the main page,
"Prime number".
There is a link to the "Primality test" entry in the
"Computational
methods" section's introductory paragraphs — repeat 3 times
quickly! — but it's overshadowed by the "Trial division"
subsection which, naturally, links to Wikipedia's "Trial division" entry.

"Trial division"
looked nothing at all like the page I remembered and it took me a while to
realize the entry was about *integer factorization*, not primality
testing. I was intrigued by the C++ code and wondered if it is vulnerable
to possible overflows. No surprise, it is.

The C++ `TrialDivision()`

function was introduced by a single phrase:

Below is version C++ (without squaring f)

The workings of the function are not documented and are not obvious.
To figure out how it works, I had to figure out the meaning of
`ac`

, `temp`

, and the magic number 8. Here's
a version of the function with meaningful variable names and some
meaningful comments:

template <class T, class U> vector<T> TrialDivision(U n) { // The difference between the squares of two consecutive // odd factors, f and f+2, is (f+2)^2 - f^2 = 4f + 4. // The differences, 4f+4, increase as the factors // increase, but the difference between any two // consecutive differences, 4f+4 and 4(f+2)+4, is constant: // 4(f+2) + 4 - (4f + 4) = 4f + 8 + 4 - 4f - 4 = 8. const T DIFF_4FP4 = 8; vector<T> v; while (n % 2 == 0) // Check for factors of 2. { v.push_back(2); n /= 2; } while (n % 3 == 0) // Check for factors of 3. { v.push_back(3); n /= 3; } T factor = 3; T square = factor * factor; T diff = 4*factor + 4; // Difference between 3^2 and 5^2. factor = 5; do { square += diff; // Overflow possible!!! if (square > n) break; // All factors tested? if (n % factor == 0) { // Found a factor? v.push_back(factor); n /= factor; square -= diff; // Try factor again, // so revert to factor^2, // which might have overflowed!!! } else { // Not a factor, so advance factor += 2; // to the next odd number. diff += DIFF_4FP4; // Overflow possible!!! } } while (1); if (n != 1) v.push_back(n); return v; }

To avoid multiplying the factor by itself on every factor tested,
the function keeps track of the squares of the odd factors
(`ac`

/`square`

)
and the differences between the squares of consecutive odd factors
(`temp`

/`diff`

, incremented by 8 —
see the `DIFF_4FP4`

comment — on each iteration).

For example, assuming no repeating factors, here are the variable values in the first few DO-WHILE iterations:

FACTOR SQUARE DIFF ------ ------ ---- [Implicit previous factor: 3 9 16 -- 3 is f; 9 is f^2; 16 is 4f+4 or 4(3)+4. ] 5 25 24 -- 5 is f; 25 is f^2; 24 is 4f+4 or 4(5)+4. 7 49 32 9 81 40 11 121 48 ... |

The phrase, "*in* the ... iterations", is purposeful and the table
above is for illustrative purposes. In the actual code, when unsuccessfully
testing consecutive odd factors, `factor`

is the only up-to-date
variable at the very *beginning* of the loop; `square`

and
`diff`

lag behind, still based on the previous factor. For example,
at the top of the loop on the very first iteration, `factor`

is the
correct 5, but `square`

and `diff`

are based on the
previous factor, 3, and thus have the values 9 and 16, respectively. The
latter two get updated within the loop. If a test is successful, the same
factor must be tried again; in this case, the loop preserves the lagging
values of `square`

and `diff`

and starts the next
iteration in the same state as that at the beginning of the current iteration.
For example, if a loop begins with 11/81/40 and 11 is found to be a factor,
the next iteration also begins with 11/81/40 in order to test 11 again.
(The implementation of the "diff" algorithm in my factorization test and
compare program, **frize**,
moves the updating of both `square`

and `diff`

to the
bottom of the loop, so all 3 variables are up-to-date at the top of the loop,
as seen in the table above.)

There are two sources of possible overflow: *(i)* adding
`diff`

to `square`

and *(ii)* incrementing
`diff`

by 8. And the two sources interact badly. The
overflows begin occurring when factoring very large prime numbers.
A prime number has exactly one factor, itself, so the testing loop will
continue until the square root of the prime number is tested as a factor.
To break out of the loop, the factor being tested must have a square
greater than the prime number (`square > n`

) and that's when
overflows become possible.

To illustrate, let's use unsigned 32-bit integers. First, some preliminaries:

Largest 32-bit value: 4,294,967,295 (0xFFFFFFFF) Square root of above: 65,535.999... Integer square root: 65,535 (0xFFFF) Square of 65,535: 4,294,836,225 |

Note that the real square root of the maximum representable value is a shade under 65,536 and the integer square root is 65,535. The square of the latter is 4,294,836,225, which is 131,070 less than the largest value. The 5,853 prime numbers in that top range cannot be factored without overflow happening because the square of the factor being tested must exceed the prime number. When 65,535 is tested as a factor of the prime number, its square is less than the prime number, so the loop must compute the square of the next factor, 65,537. This square overflows the maximum representable 32-bit value.

As a result and although the odd factors steadily increase by 2, the square
has suddenly dropped down to a low number and must begin anew the Sisyphean
climb to the top. `Square`

now has no relation to
`factor`

or `diff`

.

`Square`

is again less than `n`

, so the function keeps
testing new factors even though they're past the square root mark. Since
`diff`

was not reset, `square`

will move through the
32-bit integers even faster on the second pass until it overflows again.
The cycle will repeat itself over and over. Meanwhile, `factor`

keeps incrementing by 2 and `diff`

by 8. As `diff`

gets very large, it begins interacting with even small values of
`square`

, so that adding `diff`

to `square`

causes `square`

to overflow. It gets to the point where
`square`

is overflowing on every iteration of the testing loop.
At times, the overflows will produce descending values of `square`

.

We're not done yet. Eventually `diff`

reaches 4,294,967,288,
seven less than the maximum representable 32-bit value. When 8 is added
to `diff`

, `diff`

conveniently overflows to zero.
`Square`

can now take its time running through the integers
before overflowing.

At most, the function should test about 32,767 (64K/2) odd factors for
any 32-bit number. If data type `T`

is wider than data type
`U`

(the `TrialDivision()`

template parameters),
the function will not exceed the test limit (32K if `U`

is
an unsigned 32-bit integer). Type `T`

is used for internal
calculations; type `U`

is the type of the number to be factored.
For example, if `T`

is 64 bits and `U`

is 32 bits,
no overflows will occur. Even declaring `T`

as
`uint32_t`

(32 bits of precision) and `U`

as
`int32_t`

(31 bits of positive precision) will prevent
overflows. The trade-off is that narrowing type `U`

narrows the range of numbers that can be factored.

When types `T`

and `U`

are equal in width, overflows
are possible and can lead to pathological behavior of the function. To wit:

Largest 32-bit prime: 4,294,967,291 (4 less than UINT32_MAX) # of odd factors tested: 2,147,483,644 (not including test of 2) # of SQUARE overflows: 1,073,741,824 # of DIFF overflows: 3 |

In this case, the test loop is terminated when `factor`

equals
`n`

(i.e., 4,294,967,291) and `factor % n`

is zero,
not when `square > n`

. Thus, the number of factors tested
(2 and odd) is `n/2`

. `Diff`

overflows occur when
advancing to factors 1G-1, 2G-1, and 3G-1, where G is in the sense of
powers of 1024. The `diff`

overflows are expected on multiples
of 1G because the number of odd factors tested in a range of 1G is 1G/2 and
`diff`

is incremented by 8 on each test, so `diff`

reaches the overflow value at (1G/2)×8 = 4G, one more than UINT32_MAX.
(Because the input prime is so large in the example above, when the loop
terminates, the loop is only several tests shy of a fourth `diff`

overflow.)

The first improvement to `TrialDivision()`

was to get rid of
the `temp`

/`diff`

variable. The formula for the
next `square`

is easy enough and fast enough to eliminate
the need for and presence of a second source of overflow:

template <class T, class U> vector<T> TrialDivision(U n) { vector<T> v; while (n % 2 == 0) // Check for factors of 2. { v.push_back(2); n /= 2; } T factor = 3; T square = factor * factor; while (square <= n) { // Until all factors are tested. if (n % factor == 0) { // Found a factor? v.push_back(factor); n /= factor; // Leave square alone and // try factor again. } else { // Not a factor, so advance // to the next odd number. // (f + 2)^2 = // f^2 + 4f + 4 = // f^2 + 4(f + 1) factor++ ; // Overflow possible next!!! square += factor * 4 ; factor++ ; } } if (n != 1) v.push_back(n); return v; }

The code is cleaner now. The next square is calculated simply by adding
`(factor + 1) * 4`

to the current square; an optimizing compiler
will replace the multiplicaton with a left shift of two bits. There is
still the same possibility of `square`

overflowing under the
same circumstances. Assuming that overflow causes the value of a variable
to "wrap around", it is easy enough to detect just by comparing the old
and new values of `square`

. If the new value is less than the
old value, then overflow occurred.

With this code in hand, an optimization immediately comes to mind.
`TrialDivision()`

checks every odd number from 3 on up
as a factor. Borrowing the *6k±1* optimization from
primality testing and skipping odd
factors divisible by 3, the time to factor a number is reduced by up
to a third. Doing this also decreases the range of numbers that can
be factored which, again, is attributable to overflow. In the example
presented earlier, the top 5,853 32-bit unsigned prime numbers cannot
be factored without overflow; that's with the algorithm that tests 2
and all odd factors. The *6k±1* optimization increases
that number to the largest 17,565 primes. (This increase is because
the maximum square root is 65,535, which happens to be divisible by 3.
When checking all odd factors, the algorithm will test factors up to
and including 65,535. With the *6k±1* optimization, the
highest factor tested drops back to 65,533 and the next factor tested,
as it must be, is 65,537, whose square overflows. Consequently, the
17,565 primes greater than or equal to 65,533^{2} [4,294,574,089]
cannot be factored without the square overflowing.)

Eliminating the possibility of overflow and restoring the range of
numbers that can be factored is acomplished by eliminating the use
of squares. A "fast integer square root" function can very quickly
calculate the square root (and highest possible factor) of the input
number if the number is (or can be converted to) an unsigned integer
with an even number of bits. My C implementation of
`TrialDivision()`

calls `nbrUnsignedSQRT()`

,
which is in "`nbr_util_roots.c`

".

There is an implicit limit on the number of factors tested in the squares
version of `TrialDivision()`

that I did not realize at first when
I wrote up the square root version of `TrialDivision()`

. Initially,
the *target* number is the original input number. Whenever a factor
is found (including 2s and 3s), the target number is reduced (divided) by
the factor. Consequently, the `square <= n`

loop test in the
function above is actually testing two limit conditions:

If the original input number is composite, the target number is reduced by each factor until the target is reduced to 1 after the last factor. At this point, there is no need to test factors larger than the last factor, so the loop can terminate. And in the squares version of

`TrialDivision()`

, the loop does so because`square > 1`

.If the original input number is prime, no factors will be found and the target will remain equal to the original input number. To ascertain that there are no factors and that the number is prime, it is necessary to check all the possible factors up to the square root of the number (i.e.,

`square <= n`

).

Limit #2 is a hard limit needed for prime numbers. Limit #1 is just an optimization. When the target number is reduced to 1, it is perfectly acceptable, albeit time-consuming, to continue unsuccessfully testing factors up to the square root.

Here's the square root version of `TrialDivision()`

that
*(i)* eliminates the possibility of overflow, *(ii)*
implements the *6k±1* optimization, and *(iii)*
includes both limit tests:

template <class T, class U> vector<T> TrialDivision(U n) { vector<T> v; while (n % 2 == 0) // Check for factors of 2. { v.push_back(2); n /= 2; } while (n % 3 == 0) // Check for factors of 3. { v.push_back(3); n /= 3; } T factor = 5; T root = (T) UnsignedSQRT (n); bool byTwo = true; // Increment factor by 2 or by 4. while ((n > 1) && // Until n is reduced to 1 ... (factor <= root)) { // ... or all factors are tested. if (n % factor == 0) { // Found a factor? v.push_back(factor); n /= factor; // Try factor again. } else { // Not a factor. if (byTwo) factor += 2; // Advance to next odd factor. else factor += 4; // Skip odd factor divisible by 3. byTwo = !byTwo; // Alternate between 2 and 4. } } if (n != 1) v.push_back(n); // N not reduced to 1 means N is prime. return v; }

Not being a big fan of C++, I implemented this last version in C and
called it `nbrFactorsOf()`

. The function uses a single type
for both the input number and internal calculations, *factor_t*,
which is defined in "`nbr_util.h`

". Factors of the input
number are stored in an array of `factor_t`

s supplied by the
caller. (The maximum number of factors that may be returned is *w-1*,
where *w* is the bit width of `factor_t`

. In other words,
the number of factors increases as the magnitudes of the factors decrease.
The smallest factor is 2, so the largest power of 2 representable in
`factor_t`

will have the maximum number of factors:
2_{1}×2_{2}×...×2_{w-1}.
If `factor_t`

is 32-bits wide, the largest power of 2 is
2^{31} (0x80000000) and there can be at most 31 factors;
64 bits will have at most 63 factors.

Two additional optimizations were added to the C function:

A list of the first M prime numbers was added: 2, 3, 5, 7, 11, ... These numbers are tested as factors at the very beginning of

`nbrFactorsOf()`

, thus giving the function a headstart by not having to test non-prime odd factors in the range covered by the list. M is a compile-time constant,`NBR_EARLY_PRIMES`

, and there is the obvious trade-off between speed and the memory footprint of the list.When a factor is found and the target number is reduced, the square root of the target number is recalculated.

To save you the trouble of reading the analysis below, here's the short summary:

The timings of the different algorithms I implemented were comparable.

The

*6k±1*optimization had the biggest impact on all the algorithms, reducing their times by one-third.Computing the square root at the outset offers a slight improvement in timing, but the big gain is that there are no overflows and, therefore, all of the primes which fit into the chosen integer data type can be factored.

18,446,744,065,119,617,025 is a large, 64-bit number with 10 factors:

3×3 × 5×5 × 17×17 × 257×257 × 65537×65537

LIBGPL function `nbrFactorsOf()`

quickly tests the first 512,
precomputed prime numbers as factors, up through 3,671. The function then
drops into the middle of the *6k±1* cycle and begins with 3,673
testing odd numbers not divisible by 3 as factors. Testing the early primes
as factors catches 8 out of the 10 factors of 18,446,744,065,119,617,025,
leaving only 65,537^{2} to be factored.

The algorithms implemented in and tested by the **frize** program all
have this early-prime optimization available to them, but, in the timing
runs I present below, the optimization is disabled on the command line.

In the timed tests that follow, **frize** factors the big number above
15,000 times in order to get large enough timing measurements. The command
line used for each run was:

% frize -early 0 -time -repeat 15000 18446744065119617025,algo

The different algorithms performed as follows on this factorization (the "root" and "root3" figures include the number of fast integer square roots calculated during the factorization):

"diff": 21.3 seconds 32,777 factors tested "square": 20.9 seconds 32,777 factors tested "square3": 14.1 seconds 21,855 factors tested "root": 19.6 seconds 32,777 factors tested 10 roots "root3": 13.4 seconds 21,855 factors tested 8 roots "nbr": 12.6 seconds (with 512 early primes) |

The elapsed times for the "diff", "square", and "root" algorithms are
close, as are the times for the "square3" and "root3" algorithms. The
*6k±1* optimization obviously has the biggest effect since
it reduces the number of potential factors tested by a third.

Testing the first 512 primes, `nbrFactorsOf()`

is slightly
faster than "square3" and "root3"; however, it only takes about 130
tests at most for the other algorithms to reach the 7th and 8th factor,
257. If the "`-early 0`

" option is removed from the command
line, **frize** will test up to 6,542 prime numbers as factors,
ranging from 2 to 65,521. The last number is the largest, unsigned
16-bit prime and is 16 less than the last two factors, 65,537.
Consequently, only a few extra odd numbers have to be tested as factors
and all the algorithms factor the input number 15,000 times in 4 seconds
or less! (Except for the "nbr" algorithm which only tests 512 early
primes anyway.)

18,446,744,047,939,747,781 is the largest prime number that the "square3"
algorithm can factor without 64-bit overflow. The number has only one
factor: itself. It takes about a minute or more to factor the number
once, so the **frize** command line for only one factorization instead of
15,000 is:

% frize -early 0 -time 18446744047939747781,algo

And the timing results are:

"diff": 93.2 seconds 2,147,483,646 factors tested "square": 91.0 seconds 2,147,483,646 factors tested "square3": 61.4 seconds 1,431,655,765 factors tested "root": 85.6 seconds 2,147,483,646 factors tested 1 root "root3": 58.5 seconds 1,431,655,765 factors tested 1 root "nbr": 55.0 seconds (with 512 early primes) |

Again, the times are comparable among the algorithms, the *6k±1*
optimization had the biggest effect on performance, and the root methods
aren't subject to overflow.