GEONius.com 11-Aug-2019

# Fast Integer Square Roots

### Understanding Ross Fosler's Assembly Code

Related functions and programs:
NBR_UTIL - prime number and integer square root functions; e.g., `nbrUnsignedSQRT()`.
squint - integer square root algorithms test program.

## Introduction

NOTE: The algorithm works and achieves its speed by examining and manipulating bits in the standard, binary representation of unsigned integers. In the rare case of a target CPU using a different encoding or representation of integers, then another algorithm must be found. Examples of different representations include the various forms of binary-coded decimal and Gray code numbers. I vaguely recall, from the 1970s or 1980s, attempts to create 4-level logic elements that would have made possible quarternary representation of numbers.

Function `nbrUnsignedSQRT()` quickly computes the floor of the square root of an unsigned integer: floor[√n]; in other words, the largest integer equal to or less than the number's real square root. In the file prolog, I give the example of 34, whose real square root is about 5.8, but `nbrUnsignedSQRT()` returns 5. (The function returns a square root of 5 for 25, 26, ..., 34, and 35, only stepping up to a root of 6 for numbers 36 through 48.)

A widely cited algorithm developed by Ross M. Fosler is used:

"Fast Integer Square Root" by Ross M. Fosler
Application Note TB040, DS91040A, © 2000 Microchip Technology Inc.

combined with an ingenious tweak from Tristan Muntsinger:

http://www.codecodex.com/wiki/Calculate_an_integer_square_root#C (fourth code window under C)

## The Algorithm

NOTE: I found Ross Fosler's flowchart confusing and I wasn't familiar with the PIC18 instruction set used in his code, so I reverse-engineered the algorithm from Example 1 of his application note and I implemented that algorithm in C. The reverse-engineered algorithm is described in the following paragraphs. I later came across documentation for the PIC18 instruction set and I was able to study his assembly language code. I was surprised to find that Fosler's algorithm was slightly different than "my" algorithm. Specifically, he handled the "...10..." to "...01..." transformation (or "Shift bit right" as it's called in the application note) in an unusual, functionally equivalent way which might have been more advantageous, performance-wise, on the PIC controller. So the description below applies equally to his and my variations on his algorithm, including the limitations and risks of overflow I bring up. (I discuss the assembly language algorithm in further detail down at the end when I introduce Tristan Muntsinger's optimization.)

The algorithm begins with an initial estimate of the square root of the target number. The estimate is based on the integer square root of the maximum representable value for the unsigned integer type. Assume this type is m bits wide (where m is even).

• The maximum representable value — let's call it UM_MAX — is the value when all m bits are set. (C's ULONG_MAX is used in the code.)

• The square root of UM_MAX — let's call it UM_MAX_SQRT — is the value when all the bits are set in the least significant half of the integer type; i.e., the m/2 least significant bits are all set.

• Take the most significant 1 bit in UM_MAX_SQRT and set the corresponding bit in the estimated square root. This is bit (m/2)-1 (zero-based bit numbering) and the value of this initial estimate is (UM_MAX_SQRT+1)/2.

The algorithm then exams progressively less significant bits in the intermediate square root, bouncing above and below the actual square root on its way as bits are set or cleared, until the algorithm finally zeroes in (pun sort of intended) on the absolute least significant bit (bit 0) of the actual integer square root.

For example, let unsigned integers be 16 bits wide. UM_MAX is 0xFFFF (65,535 decimal); UM_MAX_SQRT is then the lower half of the word with all bits set: 0x00FF (255 decimal). Taking the most significant 1 bit (bit 16/2-1 = 7) from UM_MAX_SQRT gives an initial estimate of 0x0080 (128, or half of 256). In binary, the algorithm begins with an estimated square root of:

`0000 0000 1000 0000`

The adjacent 0 bit (bit 6) to the right of the 1 bit is examined first. If the square of the intermediate root (0x0080) is less than the target number (i.e., the intermediate root is less than the actual root), then set that bit in order to increase the value (0x00C0, 192 decimal) of the intermediate root, possibly overshooting the actual root:

`0000 0000 1100 0000`

For illustrative purposes, let's assume that was the case. Now move to the next 0 bit, bit 5. If the intermediate root is less than the actual root, the algorithm would set bit 5 (0x00E0 — 3 bits set) and loop again. (I'm using shorthand here for what is actually a comparison of the square of the intermediate root and the target number.) Ross Fosler calls setting a bit under this condition "Start new bit" in his flowchart.

If, however, the intermediate root is greater than the actual root, lower the intermediate root by clearing the previous bit (bit 6) and setting the current bit (bit 5), giving a value of 0x00A0 and possibly undershooting the actual root:

`0000 0000 1010 0000`

The pattern here is that if the previous and current bits are 10 (binary) and the intermediate root is too large, then substitute 01 (binary) for the two bits. This is essentially picking the value, 01, halfway between 00 (tested when the previous bit was examined) and 10 (the result of the previous bit being examined). Ross Fosler calls this "...10..." to "...01..." transformation "Shift bit right" in his flowchart.

Perhaps more clearly, let the number in curly braces be the number of most significant bits of the root determined so far. Following the example above, the initial estimate is:

`estimate{0} = 0000 0000 1000 0000`

The estimate is less than the actual square root, so the most significant bit has been determined and we next try setting the adjacent bit:

`estimate{1} = 0000 0000 1100 0000`

(Although the first two bits are set, it's only `estimate{1}`. We know the left bit is correct, but the right bit is, as yet, only a trial bit.) This estimate is greater than the square root and we now know that the root must fall between the previous estimate and the current estimate:

`estimate{0} < root < estimate{1}`

So we pick a new estimate halfway between the two estimates. This requires clearing the bit we set in `estimate{1}`, so we know for certain the two most significant bits ("10" instead of "11") of the root and we add a third trial bit ("1"):

 ```estimate{2} = (estimate{0} + estimate{1}) / 2 = 0000 0000 1010 0000 ```

Although my example happened to fall at the very beginning of the estimates, the same logic is applied at an arbitrary bit location in the estimates:

`estimate{b+2} = (estimate{b} + estimate{b+1}) / 2`

Finally, it's time to consider how the algorithm terminates. If an intermediate root is equal to the actual square root, the algorithm can exit the loop immediately and return the square root to the calling program. Otherwise, the algorithm will examine all the bits in the root down to the least significant bit. A bit in a bit mask keeps track of the current bit under examination; the bit is shifted right on each iteration of the algorithm. After the algorithm examines the least significant bit of the root, the tracking bit is shifted completely out of the bit mask, leaving the bit mask equal to zero and terminating the algorithm.

Note that the "...10..." to "...01..." transformation can take place when the very first estimate, the initial estimate, is greater than the square root. In the 16-bit example above, the initial estimate can be shifted right bit by bit until a value less than or equal to the root is found. Suppose we're finding the square root of 529, which is 23 (0x17, 10111 binary). The square root will be refined as follows at each step:

 ```Precomputed initial estimate is 128 0000 0000 1000 0000 Estimate 128 > actual root 23, Shift bit right 0000 0000 0100 0000 Estimate 64 > actual root 23, Shift bit right 0000 0000 0010 0000 Estimate 32 > actual root 23, Shift bit right 0000 0000 0001 0000 Estimate 16 < actual root 23, Start new bit 0000 0000 0001 1000 Estimate 24 > actual root 23, Shift bit right 0000 0000 0001 0100 Estimate 20 < actual root 23, Start new bit 0000 0000 0001 0110 Estimate 22 < actual root 23, Start new bit 0000 0000 0001 0111 Estimate 23 == actual root 23, return! ```

## Risk of Overflow

When the width, m, of an unsigned integer is even, only the lower half of the integer is manipulated and the maximum value of that lower half is UM_MAX_SQRT — all m/2 bits set. Consequently, there is no danger of overflow in the loop if the algorithm computes the square of an intermediate root for the purpose of comparing it to the target number; e.g., "(root × root) < number".

However, if m is odd, the square root of the maximum representable value, UM_MAX, occupies (m+1)/2 bits and not all bits in that lower "half" are set. Consequently, computing "root × root" could overflow. For example, imagine a CPU that represents all integers, signed and unsigned, as 16-bit, signed-magnitude numbers where the most significant bit is the sign. The precision of integers is effectively 15 bits and:

```UM_MAX      = 0x7FFF (32767 decimal)    0111 1111 1111 1111
UM_MAX_SQRT = 0x00B5 (181 decimal)      0000 0000 1011 0101```

The most significant bit of UM_MAX_SQRT is, as with a full 16-bit word, bit 7:

`0000 0000 1000 0000`

This is the initial estimate (128) of the square root and, if the estimate is less than the actual root (i.e., the target number is between 129 and UM_MAX_SQRT), the algorithm will set bit 6 (0x00C0, 192 decimal):

`0000 0000 1100 0000`

192 is greater than UM_MAX_SQRT. In the next iteration for bit 5, multiplying 192 times 192 will overflow UM_MAX. My code avoids this by comparing the intermediate root to the target number divided by the intermediate root; e.g., the less-than comparison is transformed thus, which removes the possibility of overflow:

`(root * root) < number    ==>    root < (number / root)`

(I didn't need to look as far afield as signed-magnitude numbers! I later came across someone else's function that computes the square root of a signed, two's-complement, 32-bit integer. The effective precision in this case is 31 bits and, when multiplying "root × root", the function would regard a negative result as a sign of overflow.)

## Differences

My implementation of the algorithm differs from Ross Fosler's in the following ways:

• Fosler's and others' integer square root functions hard-code the initial estimate. Consequently, there may be separate functions for computing the square root of, for example, 16- and 32-bit numbers, each function with its own, appropiate, hard-coded initial estimate. Separate functions were necessary in Fosler's case because he was writing in assembly language for a specific class of microcontrollers and the handling of 32-bit numbers at that level is more involved than the handling of 16-bit numbers.

My C function accepts a generic, "unsigned long" integer and returns the unsigned long square root of the integer. On the first call to `nbrUnsignedSQRT()`, the function precomputes and caches (i) how many bits wide an unsigned long is and (ii) the initial estimate, which has a single bit set corresponding to the most significant bit of the maximum square root. Thus, `nbrUnsignedSQRT()` will work whatever the size of an unsigned long in the target compiler/CPU environment.

The precomputed parameters are normally stored internally in `nbrUnsignedSQRT()`. Tasks running in a shared address space can each allocate their own memory for the parameters; see `nbrUnsignedSQRT()`'s context argument.

• Because of the danger of overflow when using integers with an odd number of bits, I at first used the non-overflowing "root relop number/root" comparisons in `nbrUnsignedSQRT()`. As Ross Fosler knew before me, division operations are expensive, even on an Intel-based Linux PC. I mean, I knew that division operations are expensive, but I'll admit I was surprised when I saw how much of an impact the division operations had on `nbrUnsignedSQRT()`'s performance.

To increase the performance in the more common case, my function chooses the correct form of comparisons based on the bit width of unsigned longs. For even-precision unsigned integers, `nbrUnsignedSQRT()` can safely use these fast comparisons:

`root*root relop number`

without fear of overflow. For odd-precision unsigned integers, `nbrUnsignedSQRT()` reverts to the slow comparisons:

`root relop number/root`
• Ross Fosler's algorithm calculates the nearest integer square root of a number when floor[√n] is even. For example,

 ```sqrt(15450) = 124.3 integer root is 124 sqrt(15475) = 124.4 integer root is 124 sqrt(15500) = 124.5 integer root is 125 sqrt(15525) = 124.6 integer root is 125 ```

This makes sense since the least significant bit of an even root is "0" and thus offers a place for the algorithm to "start a new bit" right before shifting the bit-under-examination bit out of its variable and, since that bit mask is now zero, terminating the algorithm.

Odd roots already have a least significant bit of "1", so there is no way to "start a new bit" and incrementing the bit to round up the root would propagate changes back through the more significant bits. For example,

 ``` sqrt(81) = 9 integer root is 9 sqrt(86) = 9.3 integer root is 9 sqrt(89) = 9.4 integer root is 9 sqrt(91) = 9.5 integer root is 9 sqrt(92) = 9.6 integer root is 9 sqrt(99) = 9.95 integer root is 9 sqrt(100) = 10 integer root is 10 ```

I wanted `nbrUnsignedSQRT()` to return the largest integer less than or equal to the actual square root — floor[√n]. My function originally performed an extra "iteration" that transformed the least significant bit and a virtual first fractional bit from "1.0" to "0.1", which resulted in the desired behavior.

Tristan Muntsinger's optimization moves the shift and test of the bit-under-examination mask above/before the "start new bit" operation, so his algorithm terminates before it can "start a new bit" in the least-significant bit location. Since his algorithm always returns floor[√n], when I incorporated his optimization into my code, the extra iteration was no longer necessary.

As an addendum to the differences above, I show here how my code evolved with better understanding and with exposure to others' algorithms.

Using Tristan Muntsinger's shorter and cleaner variable names, here's my original interpretation of Ross Fosler's algorithm, based on his example:

```    -- Variable names: g = root, c = bitmask, n = number

g = 1                   -- Find most significant bit of square root.
while (g < n/g)         -- (Instead of hard-coding 0x8000.)
g <<= 1
g >>= 1

c = g >> 1              -- Refine value of square root.
while (c != 0) {
if (g > n/g)        -- Intermediate root too high?
g ^= c << 1     -- Begin changing "...10..." to "...01..."
else if (g == n/g)  -- Final square root?
c = 0           -- Exit loop.
g |= c              -- Insert 1-bit if root too high or low.
c >>= 1
}

if (g > n/g)  g--       -- Ensure root is floor[√n].

return (g)
```

A little loose and slow. Here's Ross Fosler's actual, much tighter algorithm, gleaned from his code:

```    -- Variable names: g = RES, c = BITLOC, n = ARGA, save = TEMP

save = 0
g = c = 0x8000
do {
if (g*g > n)        -- Intermediate root too high?
g = save        -- Get last guess with current bit cleared.
else
save = g        -- Save current guess before setting next bit.
c >>= 1
g |= c              -- Insert 1-bit.
} while (c != 0)

return (g)              -- Final root may not be floor[√n].
```

ASIDE: There are no PIC18 arithmetic shift instructions, only rotate instructions, with or without the carry flag. Fosler's `Sqrt16()` function computes the 8-bit root of a 16-bit number. The location bit, c/BITLOC, is shifted right with a rotate right without the carry flag. The algorithm terminates not when c/BITLOC is zero, but when the location bit circles around from bit 0 to bit 7 of c/BITLOC! His `Sqrt32()` function already has to use the rotate right with carry instruction to shift the two-byte c/BITLOC, so the algorithm terminates after the location bit rotates out of bit 0 of c/BITLOC into the carry flag. Testing bit 7 in `Sqrt16()` and the carry flag in `Sqrt32()` are semantically equivalent to testing if c/BITLOC is zero.

When the intermediate root is too high, my code handled the "...10..." to "...01..." transition by clearing the previous bit and setting the current bit. Ross Fosler's code retrieved the last good value (in which the current bit was zero) and set the next bit. (NOTE that my code's "current bit" is in advance of Fosler's and Muntsinger's current bits; i.e., my current bit is the 0-bit in "...10..." while Fosler's and Muntsinger's current bit is the immediately preceding 1-bit. This happens when you try to deduce an algorithm from an example!)

Tristan Muntsinger's ingenious tweak to the general algorithm eliminates the need for my previous-bit fiddling or Fosler's last-good-value variable by moving the test for the final square root (last bit shifted out of bitmask c, leaving it zero),

```        c >>= 1          -- Fragment of Fosler's code.
g |= c
} while (c != 0)
```

up before the insertion of the 1-bit:

```    c >>= 1              -- Fragment of Muntsinger's code.
if (c == 0)  return (g)
g |= c
```

Here's his algorithm in full:

```    g = c = 0x8000
for ( ; ; ) {
if (g*g > n)  g ^= c
c >>= 1
if (c == 0)  return (g)
g |= c
}
```

Muntsigner's algorithm also ensures that floor[√n] is returned. Ultimately, I adopted his algorithm, adapting it for whatever size, fixed-width, unsigned long integers supported by the platform. The characteristics of the platform are computed on the first call to `nbrUnsignedSQRT()` and cached for subsequent use. The characteristics are (i) the precision, odd or even, of unsigned longs and (ii) the most significant bit of the maximum square root, √ULONG_MAX.

My final algorithm, with support for odd-precision numbers:

```                                       -- e.g., 0x80000000 (64-bit).
g = c = cached-most-significant-bit
for ( ; ; ) {
if (cached-odd-precision-flag) {
if (g < n/g)  g ^= c    -- Odd precision requires division.
} else if (g*g < n) {       -- Even precision allows multiplication.
g ^= c
}
c <<= 1
if (c == 0)  return (g)
g |= c
}
```

## Performance

Ross Fosler's algorithm is a straightforward (now that he came up with it for us!) and efficient means of computing the integer square root of a number. The algorithm is fast (i) when compared to an adaptation of the Newton-Raphson method to integers and (ii) because it avoids division operations, which the Newton-Raphson method would require. The latter was especially important because division was a slow operation on the microprocessor he was targeting.

(I don't know if Fosler invented the algorithm or if the algorithm is independently discovered by people who put their mind to it. His application note, however, is widely cited on the internet and in some academic papers.)

As I mentioned in the Differences section, my use of division to avoid overflow in root to number comparisons, "root relop number/root", was very slow and switching to "root×root relop number" for even-precision integers produced a big performance boost. The former, division-based comparisons are, of necessity, still used for odd-precision integers, so computing the square root on those platforms will be slow.

I benchmarked different algorithms using my integer square root test program, squint. The following algorithms, identified by their name on the test program's command line, can be tested:

nbr - the default `nbrUnsignedSQRT()` in the separate LIBGPL library.

alternate - an alternate function used to test changes that will, if they work, eventually be incorporated into `nbrUnsignedSQRT()`.

crenshaw - is Jack Crenshaw's integer square root function. His original column, "Integer Square Roots", appeared in Embedded Systems Programming, February 1, 1998. The web page is missing the images and code listings, however the code listings can be found at https://cms.edn.com/uploads/SourceCode/cren98_21.txt.

The algorithm is also explained in identical or more detail in chapter 4 of his book, Math Toolkit for Real-Time Development, later renamed Math Toolkit for Real-Time Programming. The code I used is the one presented in Listing 4 of his column and Listing 4.7 of his book. (If you know how to use Google, you can find chapter 4 online at Google books.)

martin - is Martin Buchanan's function, which bears some resemblance to Crenshaw's function. I used his 3rd integer square root function found at the Code Codex:

(The 3rd function is identical to his second function, but with different variable names.)

ross - is my original interpretation of Ross Fosler's algorithm.

tristan - is Tristan Muntsinger's integer square root function, which is a variation of Ross Fosler's algorithm. Martin Buchanan included this function on the same Code Codex page:

The "crenshaw", "martin", and "tristan" functions were modified to work at different precisions and were tested using 64-bit unsigned longs. All the functions use the same, one-time determination of precision as `nbrUnsignedSQRT()`. I don't yet understand how they work, so my versions of Crenshaw's and Martin's functions bail out when faced with odd-precision integers.

Since no changes are being contemplated, the "alternate" function is currently identical to `nbrUnsignedSQRT()`. And Tristan Muntsinger's function is nearly identical to the first two. Consequently, we should expect similar timing numbers.

Instead, I was met with very odd timing. My testing platform was an old PC running 64-bit Fedora Core 15, GCC 4.6.3 20120306, and with an "AMD Athlon(tm) 64 X2 Dual Core Processor 3800+" CPU. (I also did some tests on a 2015 or 2016 laptop running Linux Mint Mate 17 or 18 and on an Android 5.1 tablet with a 32-bit app.) I won't go into the details of the anomalies I saw except to mention the oddest one.

ASIDE: While researching this problem, I came across this lengthy StackOverflow discussion of the x86 `popcnt` instruction and pipelining. The title, "Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations", is innocently misleading because the observed performance deviations are not caused by the loop counter. It's scary that you can't write "deterministic" code that has consistent performance. The performance of your code is dependent, no pun intended, on what's in the pipeline, which version of a processor you're using, hidden dependencies in the instruction set, the compiler writers, etc. As with division operations being expensive, I know all these things, but this discussion really made plain to me how insidious they can be.

NOTE: To avoid confusion, the following paragraph purposefully speaks of my square root test program, squint. At the time of the testing described in the paragraph, I was actually testing both prime number algorithms and square root algorithms with a single program, prime. As hinted at in the succeeding paragraph, I eventually split prime into two, more sensibly focused programs: primal and squint.

LIBGPL's `nbrUnsignedSQRT()` is identical to squint's "alternate" function, both in the C code and in the GCC-generated, "-O1" assembly listings: the same sequence of instructions, the same registers, and the same addressing modes. Yet `nbrUnsignedSQRT()` took 14 seconds for the 100 million calculations, while squint's `unsignedSQRT_A()` took 17 seconds. Why the big difference? The assembly code for the call site in squint simply loaded the function pointer, sqrtF, into a register and made an indirect call to the selected function through the register. So there was no differentiation in the function call between the external `nbrUnsignedSQRT()` and the internal, static `unsignedSQRT_A()`. Moving `unsignedSQRT_A()` to a separate source file resulted in the timing for the alternate function dropping from 17 seconds to 15 seconds, a significant reduction, but still one second slower than `nbrUnsignedSQRT()`.

Ultimately, I moved the square root program and different functions into `squint.c` and tested the algorithms using (i) no optimization, (ii) "-O1", and (iii) "-O3". I don't know why, but I got what seemed like reasonable results from this setup. The command lines looked as follows (√15241578750190521 = 123456789):

```    % squint 15241578750190521[,algo] -time -repeat 100000000
```

With no optimization, "-O0, all the algorithms took longer than 30 seconds.

Using "-O1" in LIBGPL and in squint:

```    LIBGPL sqrt() ...
14.44 seconds (nbr)
14.40 seconds (nbr)
Alternate sqrt() ...
14.42 seconds (alt)
14.42 seconds (alt)
Crenshaw sqrt() ...
21.21 seconds (crenshaw)
21.27 seconds (crenshaw)
Martin sqrt() ...
11.30 seconds (martin)
11.21 seconds (martin)
Tristan sqrt() ...
14.31 seconds (tristan)
14.32 seconds (tristan)
```

Note that Martin Buchanan's function takes about 11 seconds and Jack Crenshaw's function is nearly twice as slow, taking about 21 seconds.

Using "-O3":

```    LIBGPL sqrt() ...
14.37 seconds (nbr)
14.27 seconds (nbr)
Alternate sqrt() ...
14.25 seconds (alt)
14.24 seconds (alt)
Crenshaw sqrt() ...
10.81 seconds (crenshaw)
10.80 seconds (crenshaw)
Martin sqrt() ...
8.94 seconds (martin)
8.95 seconds (martin)
Tristan sqrt() ...
14.33 seconds (tristan)
14.29 seconds (tristan)
```

Buchanan's function drops a little over 2 seconds, but Crenshaw's cuts its "-O1" times in half!

Conclusion: Buchanan's and Crenshaw's functions are sensitive to optimization levels. The functions based on Tristan Muntsinger's algorithm have pretty consistent performance at any non-zero level of optimization.

All in all, I've been pleased with the performance of Fosler's and Muntsinger's algorithms and I'm satisfied with the decision to use their algorithms in `nbrUnsignedSQRT()`.

## Understanding Ross Fosler's Assembly Code

A good book on designing and developing PIC18-based projects is Han-Way Huang's, PIC Microcontroller: An Introduction to Software and Hardware Interfacing. Chapter 2, "PIC18 Assembly Language Programming", can be found as a PDF on various websites. I grabbed my copy from here:

and used it to figure out how Ross Fosler's code worked. The chapter is a good tutorial, but groups of related instructions are listed in tables scattered throughout the chapter and I had to browse around looking for the instructions I was trying to decipher. (And if you're serious about PIC18 programming, you'll need Chapter 1 to learn about the CPU architecture: registers, memory layout, etc.) Section 4.10.1, in Chapter 4, presents a flowchart for computing an integer square root that is basically the same as Ross Fosler's. The book's flowchart is at a more abstract level, using a counter, i, and array notation, NUM[i], to address individual bits in a number. Actually implementing the book's algorithm would require bit masking and shifting ... in short, designing and writing code similar to Fosler's. (The PIC18 does have bit set, clear, and toggle instructions that operate on numbered bits — 0-7 in a byte! Fosler was counting clock cycles, so determining the byte and bit offset of a numbered bit in a 16-bit word would be more trouble and more time consuming than bit masking and shifting.)

Alex Measday  /  E-mail