GEONius.com

20-Nov-2024

Integer/Prime Factorization

2021

Performance

Related functions and programs:: NBR_UTIL - integer factorization function; i.e., nbrFactorsOf().; frize - integer factorization test program.

I was recently looking for the Wikipedia entry, "Primality test", that I referenced in the documentation for my nbrIsPrime() function. It's been about 3 years, so I couldn't remember the name of the entry off the top of my head. I searched the internet for "prime wikipedia", which took me to the main page, "Prime number". There is a link to the "Primality test" entry in the "Computational methods" section's introductory paragraphs — repeat 3 times quickly! — but it's overshadowed by the "Trial division" subsection which, naturally, links to Wikipedia's "Trial division" entry.

"Trial division" looked nothing at all like the page I remembered and it took me a while to realize the entry was about integer factorization, not primality testing. I was intrigued by the C++ code and wondered if it is vulnerable to possible overflows. No surprise, it is.

I wrote this page in 2021. In June 2024, the C++ code was removed from the Wikipedia entry. This archived copy of the Wikipedia page includes the C++ code that I talk about below.

The Algorithm

The C++ TrialDivision() function was introduced by a single phrase:

Below is version C++ (without squaring f)

The workings of the function are not documented and are not obvious. To figure out how it works, I had to figure out the meaning of ac, temp, and the magic number 8. Here's a version of the function with meaningful variable names and some meaningful comments:

    template <class T, class U>
    vector<T> TrialDivision(U n)
    {
            // The difference between the squares of two consecutive
            // odd factors, f and f+2, is (f+2)^2 - f^2 = 4f + 4.
            // The differences, 4f+4, increase as the factors
            // increase, but the difference between any two
            // consecutive differences, 4f+4 and 4(f+2)+4, is constant:
            // 4(f+2) + 4 - (4f + 4) = 4f + 8 + 4 - 4f - 4 = 8.
        const T DIFF_4FP4 = 8;
        vector<T> v;
        while (n % 2 == 0)          // Check for factors of 2.
            { v.push_back(2); n /= 2; }
        while (n % 3 == 0)          // Check for factors of 3.
            { v.push_back(3); n /= 3; }
        T factor = 3;
        T square = factor * factor;
        T diff = 4*factor + 4;      // Difference between 3^2 and 5^2.
        factor = 5;
        do {
            square += diff;         // Overflow possible!!!
            if (square > n) break;  // All factors tested?
            if (n % factor == 0) {  // Found a factor?
                v.push_back(factor);
                n /= factor;
                square -= diff;     // Try factor again,
                                    // so revert to factor^2,
                                    // which might have overflowed!!!
            }
            else {                  // Not a factor, so advance
                factor += 2;        // to the next odd number.
                diff += DIFF_4FP4;  // Overflow possible!!!
            }
        } while (1);
        if (n != 1) v.push_back(n);
        return v;
    }

To avoid multiplying the factor by itself on every factor tested, the function keeps track of the squares of the odd factors (ac/square) and the differences between the squares of consecutive odd factors (temp/diff, incremented by 8 — see the DIFF_4FP4 comment — on each iteration).

For example, assuming no repeating factors, here are the variable values in the first few DO-WHILE iterations:

FACTOR    SQUARE    DIFF
------    ------    ----
[Implicit previous factor:
   3         9       16    -- 3 is f; 9 is f^2; 16 is 4f+4 or 4(3)+4.
]
   5        25       24    -- 5 is f; 25 is f^2; 24 is 4f+4 or 4(5)+4.
   7        49       32
   9        81       40
  11       121       48
 ...

The phrase, "in the ... iterations", is purposeful and the table above is for illustrative purposes. In the actual code, when unsuccessfully testing consecutive odd factors, factor is the only up-to-date variable at the very beginning of the loop; square and diff lag behind, still based on the previous factor. For example, at the top of the loop on the very first iteration, factor is the correct 5, but square and diff are based on the previous factor, 3, and thus have the values 9 and 16, respectively. The latter two get updated within the loop. If a test is successful, the same factor must be tried again; in this case, the loop preserves the lagging values of square and diff and starts the next iteration in the same state as that at the beginning of the current iteration. For example, if a loop begins with 11/81/40 and 11 is found to be a factor, the next iteration also begins with 11/81/40 in order to test 11 again. (The implementation of the "diff" algorithm in my factorization test and compare program, frize, moves the updating of both square and diff to the bottom of the loop, so all 3 variables are up-to-date at the top of the loop, as seen in the table above.)

Overflows

There are two sources of possible overflow: (i) adding diff to square and (ii) incrementing diff by 8. And the two sources interact badly. The overflows begin occurring when factoring very large prime numbers. A prime number has exactly one factor, itself, so the testing loop will continue until the square root of the prime number is tested as a factor. To break out of the loop, the factor being tested must have a square greater than the prime number (square > n) and that's when overflows become possible.

To illustrate, let's use unsigned 32-bit integers. First, some preliminaries:

Largest 32-bit value:	4,294,967,295 (0xFFFFFFFF)
Square root of above:	       65,535.999...
Integer square root:	       65,535 (0xFFFF)
Square of 65,535:	4,294,836,225

Note that the real square root of the maximum representable value is a shade under 65,536 and the integer square root is 65,535. The square of the latter is 4,294,836,225, which is 131,070 less than the largest value. The 5,853 prime numbers in that top range cannot be factored without overflow happening because the square of the factor being tested must exceed the prime number. When 65,535 is tested as a factor of the prime number, its square is less than the prime number, so the loop must compute the square of the next factor, 65,537. This square overflows the maximum representable 32-bit value.

As a result and although the odd factors steadily increase by 2, the square has suddenly dropped down to a low number and must begin anew the Sisyphean climb to the top. Square now has no relation to factor or diff.

Square is again less than n, so the function keeps testing new factors even though they're past the square root mark. Since diff was not reset, square will move through the 32-bit integers even faster on the second pass until it overflows again. The cycle will repeat itself over and over. Meanwhile, factor keeps incrementing by 2 and diff by 8. As diff gets very large, it begins interacting with even small values of square, so that adding diff to square causes square to overflow. It gets to the point where square is overflowing on every iteration of the testing loop. At times, the overflows will produce descending values of square.

We're not done yet. Eventually diff reaches 4,294,967,288, seven less than the maximum representable 32-bit value. When 8 is added to diff, diff conveniently overflows to zero. Square can now take its time running through the integers before overflowing.

At most, the function should test about 32,767 (64K/2) odd factors for any 32-bit number. If data type T is wider than data type U (the TrialDivision() template parameters), the function will not exceed the test limit (32K if U is an unsigned 32-bit integer). Type T is used for internal calculations; type U is the type of the number to be factored. For example, if T is 64 bits and U is 32 bits, no overflows will occur. Even declaring T as uint32_t (32 bits of precision) and U as int32_t (31 bits of positive precision) will prevent overflows. The trade-off is that narrowing type U narrows the range of numbers that can be factored.

When types T and U are equal in width, overflows are possible and can lead to pathological behavior of the function. To wit:

Largest 32-bit prime:      4,294,967,291 (4 less than UINT32_MAX)
# of odd factors tested:   2,147,483,644 (not including test of 2)
# of SQUARE overflows:     1,073,741,824
# of DIFF overflows:       3

In this case, the test loop is terminated when factor equals n (i.e., 4,294,967,291) and factor % n is zero, not when square > n. Thus, the number of factors tested (2 and odd) is n/2. Diff overflows occur when advancing to factors 1G-1, 2G-1, and 3G-1, where G is in the sense of powers of 1024. The diff overflows are expected on multiples of 1G because the number of odd factors tested in a range of 1G is 1G/2 and diff is incremented by 8 on each test, so diff reaches the overflow value at (1G/2)×8 = 4G, one more than UINT32_MAX. (Because the input prime is so large in the example above, when the loop terminates, the loop is only several tests shy of a fourth diff overflow.)

Looking Good

The first improvement to TrialDivision() was to get rid of the temp/diff variable. The formula for the next square is easy enough and fast enough to eliminate the need for and presence of a second source of overflow:

    template <class T, class U>
    vector<T> TrialDivision(U n)
    {
        vector<T> v;
        while (n % 2 == 0)          // Check for factors of 2.
            { v.push_back(2); n /= 2; }
        T factor = 3;
        T square = factor * factor;
        while (square <= n) {       // Until all factors are tested.
            if (n % factor == 0) {  // Found a factor?
                v.push_back(factor);
                n /= factor;        // Leave square alone and
                                    // try factor again.
            }
            else {                  // Not a factor, so advance
                                    // to the next odd number.
                    // (f + 2)^2 =
                    // f^2 + 4f + 4 =
                    // f^2 + 4(f + 1)
                factor++ ;          // Overflow possible next!!!
                square += factor * 4 ;
                factor++ ;
            }
        }
        if (n != 1) v.push_back(n);
        return v;
    }

The code is cleaner now. The next square is calculated simply by adding (factor + 1) * 4 to the current square; an optimizing compiler will replace the multiplicaton with a left shift of two bits. There is still the same possibility of square overflowing under the same circumstances. Assuming that overflow causes the value of a variable to "wrap around", it is easy enough to detect just by comparing the old and new values of square. If the new value is less than the old value, then overflow occurred.

An Optimization

With this code in hand, an optimization immediately comes to mind. TrialDivision() checks every odd number from 3 on up as a factor. Borrowing the 6k±1 optimization from primality testing and skipping odd factors divisible by 3, the time to factor a number is reduced by up to a third. Doing this also decreases the range of numbers that can be factored which, again, is attributable to overflow. In the example presented earlier, the top 5,853 32-bit unsigned prime numbers cannot be factored without overflow; that's with the algorithm that tests 2 and all odd factors. The 6k±1 optimization increases that number to the largest 17,565 primes. (This increase is because the maximum square root is 65,535, which happens to be divisible by 3. When checking all odd factors, the algorithm will test factors up to and including 65,535. With the 6k±1 optimization, the highest factor tested drops back to 65,533 and the next factor tested, as it must be, is 65,537, whose square overflows. Consequently, the 17,565 primes greater than or equal to 65,533² [4,294,574,089] cannot be factored without the square overflowing.)

Roots, Not Squares

Eliminating the possibility of overflow and restoring the range of numbers that can be factored is acomplished by eliminating the use of squares. A fast integer square root function can very quickly calculate the square root (and highest possible factor) of the input number if the number is (or can be converted to) an unsigned integer with an even number of bits. My C implementation of TrialDivision() calls nbrUnsignedSQRT(), which is in "nbr_util_roots.c".

There is an implicit limit on the number of factors tested in the squares version of TrialDivision() that I did not realize at first when I wrote up the square root version of TrialDivision(). Initially, the target number is the original input number. Whenever a factor is found (including 2s and 3s), the target number is reduced (divided) by the factor. Consequently, the square <= n loop test in the function above is actually testing two limit conditions:

If the original input number is composite, the target number is reduced by each factor until the target is reduced to 1 after the last factor. At this point, there is no need to test factors larger than the last factor, so the loop can terminate. And in the squares version of TrialDivision(), the loop does so because square > 1.
If the original input number is prime, no factors will be found and the target will remain equal to the original input number. To ascertain that there are no factors and that the number is prime, it is necessary to check all the possible factors up to the square root of the number (i.e., square <= n).

Limit #2 is a hard limit needed for prime numbers. Limit #1 is just an optimization. When the target number is reduced to 1, it is perfectly acceptable, albeit time-consuming, to continue unsuccessfully testing factors up to the square root.

Here's the square root version of TrialDivision() that (i) eliminates the possibility of overflow, (ii) implements the 6k±1 optimization, and (iii) includes both limit tests:

    template <class T, class U>
    vector<T> TrialDivision(U n)
    {
        vector<T> v;
        while (n % 2 == 0)          // Check for factors of 2.
            { v.push_back(2); n /= 2; }
        while (n % 3 == 0)          // Check for factors of 3.
            { v.push_back(3); n /= 3; }
        T factor = 5;
        T root = (T) UnsignedSQRT (n);
        bool byTwo = true;          // Increment factor by 2 or by 4.
        while ((n > 1) &&           // Until n is reduced to 1 ...
            (factor <= root)) {  // ... or all factors are tested.
            if (n % factor == 0) {  // Found a factor?
                v.push_back(factor);
                n /= factor;        // Try factor again.
            }
            else {                  // Not a factor.
                if (byTwo)
                    factor += 2;    // Advance to next odd factor.
                else
                    factor += 4;    // Skip odd factor divisible by 3.
                byTwo = !byTwo;     // Alternate between 2 and 4.
            }
        }
        if (n != 1) v.push_back(n); // N not reduced to 1 means N is prime.
        return v;
    }

Not being a big fan of C++, I implemented this last version in C and called it nbrFactorsOf(). The function uses a single type for both the input number and internal calculations, factor_t, which is defined in "nbr_util.h". Factors of the input number are stored in an array of factor_ts supplied by the caller. (The maximum number of factors that may be returned is w-1, where w is the bit width of factor_t. In other words, the number of factors increases as the magnitudes of the factors decrease. The smallest factor is 2, so the largest power of 2 representable in factor_t will have the maximum number of factors: 2₁×2₂×...×2_w-1. If factor_t is 32-bits wide, the largest power of 2 is 2³¹ (0x80000000) and there can be at most 31 factors; 64 bits will have at most 63 factors.

Two additional optimizations were added to the C function:

A list of the first M prime numbers was added: 2, 3, 5, 7, 11, ... These numbers are tested as factors at the very beginning of nbrFactorsOf(), thus giving the function a headstart by not having to test non-prime odd factors in the range covered by the list. M is a compile-time constant, NBR_EARLY_PRIMES, and there is the obvious trade-off between speed and the memory footprint of the list.
When a factor is found and the target number is reduced, the square root of the target number is recalculated.

Performance

To save you the trouble of reading the analysis below, here's the short summary:

The timings of the different algorithms I implemented were comparable.
The 6k±1 optimization had the biggest impact on all the algorithms, reducing their times by one-third.
Computing the square root at the outset offers a slight improvement in timing, but the big gain is that there are no overflows and, therefore, all of the primes which fit into the chosen integer data type can be factored.

18,446,744,065,119,617,025 is a large, 64-bit number with 10 factors:

3×3 × 5×5 × 17×17 × 257×257 × 65537×65537

LIBGPL function nbrFactorsOf() quickly tests the first 512, precomputed prime numbers as factors, up through 3,671. The function then drops into the middle of the 6k±1 cycle and begins with 3,673 testing odd numbers not divisible by 3 as factors. Testing the early primes as factors catches 8 out of the 10 factors of 18,446,744,065,119,617,025, leaving only 65,537² to be factored.

The algorithms implemented in and tested by the frize program all have this early-prime optimization available to them, but, in the timing runs I present below, the optimization is disabled on the command line.

In the timed tests that follow, frize factors the big number above 15,000 times in order to get large enough timing measurements. The command line used for each run was:

    % frize -early 0 -time -repeat 15000 18446744065119617025,algo

The different algorithms performed as follows on this factorization (the "root" and "root3" figures include the number of fast integer square roots calculated during the factorization):

"diff":      21.3 seconds   32,777 factors tested
"square":    20.9 seconds   32,777 factors tested
"square3":   14.1 seconds   21,855 factors tested
"root":      19.6 seconds   32,777 factors tested   10 roots
"root3":     13.4 seconds   21,855 factors tested    8 roots

"nbr":       12.6 seconds (with 512 early primes)

The elapsed times for the "diff", "square", and "root" algorithms are close, as are the times for the "square3" and "root3" algorithms. The 6k±1 optimization obviously has the biggest effect since it reduces the number of potential factors tested by a third.

Testing the first 512 primes, nbrFactorsOf() is slightly faster than "square3" and "root3"; however, it only takes about 130 tests at most for the other algorithms to reach the 7th and 8th factor, 257. If the "-early 0" option is removed from the command line, frize will test up to 6,542 prime numbers as factors, ranging from 2 to 65,521. The last number is the largest, unsigned 16-bit prime and is 16 less than the last two factors, 65,537. Consequently, only a few extra odd numbers have to be tested as factors and all the algorithms factor the input number 15,000 times in 4 seconds or less! (Except for the "nbr" algorithm which only tests 512 early primes anyway.)

18,446,744,047,939,747,781 is the largest prime number that the "square3" algorithm can factor without 64-bit overflow. The number has only one factor: itself. It takes about a minute or more to factor the number once, so the frize command line for only one factorization instead of 15,000 is:

    % frize -early 0 -time 18446744047939747781,algo

And the timing results are:

"diff":      93.2 seconds   2,147,483,646 factors tested
"square":    91.0 seconds   2,147,483,646 factors tested
"square3":   61.4 seconds   1,431,655,765 factors tested
"root":      85.6 seconds   2,147,483,646 factors tested   1 root
"root3":     58.5 seconds   1,431,655,765 factors tested   1 root

"nbr":       55.0 seconds (with 512 early primes)

Again, the times are comparable among the algorithms, the 6k±1 optimization had the biggest effect on performance, and the root methods aren't subject to overflow.

Alex Measday / E-mail