Constant gmp_mpfr_sys::C::GMP::Algorithms [−][src]
pub const Algorithms: ();
This constant is a place-holder for documentation; do not use it in code.
Next: Internals, Previous: Language Bindings, Up: Top [Index]
15 Algorithms
This chapter is an introduction to some of the algorithms used for various GMP operations. The code is likely to be hard to understand without knowing something about the algorithms.
Some GMP internals are mentioned, but applications that expect to be compatible with future GMP releases should take care to use only the documented functions.
Next: Division Algorithms, Previous: Algorithms, Up: Algorithms [Index]
15.1 Multiplication
NxN limb multiplications and squares are done using one of seven algorithms, as the size N increases.
Algorithm Threshold Basecase (none) Karatsuba MUL_TOOM22_THRESHOLD
Toom-3 MUL_TOOM33_THRESHOLD
Toom-4 MUL_TOOM44_THRESHOLD
Toom-6.5 MUL_TOOM6H_THRESHOLD
Toom-8.5 MUL_TOOM8H_THRESHOLD
FFT MUL_FFT_THRESHOLD
Similarly for squaring, with the SQR
thresholds.
NxM multiplications of operands with different sizes above
MUL_TOOM22_THRESHOLD
are currently done by special Toom-inspired
algorithms or directly with FFT, depending on operand size (see Unbalanced Multiplication).
Next: Karatsuba Multiplication, Previous: Multiplication Algorithms, Up: Multiplication Algorithms [Index]
15.1.1 Basecase Multiplication
Basecase NxM multiplication is a straightforward rectangular set of cross-products, the same as long multiplication done by hand and for that reason sometimes known as the schoolbook or grammar school method. This is an O(N*M) algorithm. See Knuth section 4.3.1 algorithm M (see References), and the mpn/generic/mul_basecase.c code.
Assembly implementations of mpn_mul_basecase
are essentially the same
as the generic C code, but have all the usual assembly tricks and
obscurities introduced for speed.
A square can be done in roughly half the time of a multiply, by using the fact that the cross products above and below the diagonal are the same. A triangle of products below the diagonal is formed, doubled (left shift by one bit), and then the products on the diagonal added. This can be seen in mpn/generic/sqr_basecase.c. Again the assembly implementations take essentially the same approach.
u0 u1 u2 u3 u4 +---+---+---+---+---+ u0 | d | | | | | +---+---+---+---+---+ u1 | | d | | | | +---+---+---+---+---+ u2 | | | d | | | +---+---+---+---+---+ u3 | | | | d | | +---+---+---+---+---+ u4 | | | | | d | +---+---+---+---+---+
In practice squaring isn’t a full 2x faster than multiplying, it’s
usually around 1.5x. Less than 1.5x probably indicates
mpn_sqr_basecase
wants improving on that CPU.
On some CPUs mpn_mul_basecase
can be faster than the generic C
mpn_sqr_basecase
on some small sizes. SQR_BASECASE_THRESHOLD
is
the size at which to use mpn_sqr_basecase
, this will be zero if that
routine should be used always.
Next: Toom 3-Way Multiplication, Previous: Basecase Multiplication, Up: Multiplication Algorithms [Index]
15.1.2 Karatsuba Multiplication
The Karatsuba multiplication algorithm is described in Knuth section 4.3.3 part A, and various other textbooks. A brief description is given here.
The inputs x and y are treated as each split into two parts of equal length (or the most significant part one limb shorter if N is odd).
high low +----------+----------+ | x1 | x0 | +----------+----------+ +----------+----------+ | y1 | y0 | +----------+----------+
Let b be the power of 2 where the split occurs, i.e. if x0 is k limbs (y0 the same) then b=2^(k*mp_bits_per_limb). With that x=x1*b+x0 and y=y1*b+y0, and the following holds,
x*y = (b^2+b)*x1*y1 - b*(x1-x0)*(y1-y0) + (b+1)*x0*y0
This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas a basecase multiply of NxN limbs is equivalent to four multiplies of (N/2)x(N/2). The factors (b^2+b) etc represent the positions where the three products must be added.
high low +--------+--------+ +--------+--------+ | x1*y1 | | x0*y0 | +--------+--------+ +--------+--------+ +--------+--------+ add | x1*y1 | +--------+--------+ +--------+--------+ add | x0*y0 | +--------+--------+ +--------+--------+ sub | (x1-x0)*(y1-y0) | +--------+--------+
The term (x1-x0)*(y1-y0) is best calculated as an absolute value, and the sign used to choose to add or subtract. Notice the sum high(x0*y0)+low(x1*y1) occurs twice, so it’s possible to do 5*k limb additions, rather than 6*k, but in GMP extra function call overheads outweigh the saving.
Squaring is similar to multiplying, but with x=y the formula reduces to an equivalent with three squares,
x^2 = (b^2+b)*x1^2 - b*(x1-x0)^2 + (b+1)*x0^2
The final result is accumulated from those three squares the same way as for the three multiplies above. The middle term (x1-x0)^2 is now always positive.
A similar formula for both multiplying and squaring can be constructed with a middle term (x1+x0)*(y1+y0). But those sums can exceed k limbs, leading to more carry handling and additions than the form above.
Karatsuba multiplication is asymptotically an O(N^1.585) algorithm,
the exponent being log(3)/log(2), representing 3 multiplies
each 1/2 the size of the inputs. This is a big improvement over the
basecase multiply at O(N^2) and the advantage soon overcomes the extra
additions Karatsuba performs. MUL_TOOM22_THRESHOLD
can be as little
as 10 limbs. The SQR
threshold is usually about twice the MUL
.
The basecase algorithm will take a time of the form M(N) = a*N^2 + b*N + c and the Karatsuba algorithm K(N) = 3*M(N/2) + d*N + e, which expands to K(N) = 3/4*a*N^2 + 3/2*b*N + 3*c + d*N + e. The
factor 3/4 for a means per-crossproduct speedups in the
basecase code will increase the threshold since they benefit M(N) more
than K(N). And conversely the 3/2 for b means
linear style speedups of b will increase the threshold since they
benefit K(N) more than M(N). The latter can be seen for
instance when adding an optimized mpn_sqr_diagonal
to
mpn_sqr_basecase
. Of course all speedups reduce total time, and in
that sense the algorithm thresholds are merely of academic interest.
Next: Toom 4-Way Multiplication, Previous: Karatsuba Multiplication, Up: Multiplication Algorithms [Index]
15.1.3 Toom 3-Way Multiplication
The Karatsuba formula is the simplest case of a general approach to splitting inputs that leads to both Toom and FFT algorithms. A description of Toom can be found in Knuth section 4.3.3, with an example 3-way calculation after Theorem A. The 3-way form used in GMP is described here.
The operands are each considered split into 3 pieces of equal length (or the most significant part 1 or 2 limbs shorter than the other two).
high low +----------+----------+----------+ | x2 | x1 | x0 | +----------+----------+----------+ +----------+----------+----------+ | y2 | y1 | y0 | +----------+----------+----------+
These parts are treated as the coefficients of two polynomials
X(t) = x2*t^2 + x1*t + x0 Y(t) = y2*t^2 + y1*t + y0
Let b equal the power of 2 which is the size of the x0, x1, y0 and y1 pieces, i.e. if they’re k limbs each then b=2^(k*mp_bits_per_limb). With this x=X(b) and y=Y(b).
Let a polynomial W(t)=X(t)*Y(t) and suppose its coefficients are
W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0
The w[i] are going to be determined, and when they are they’ll give the final result using w=W(b), since x*y=X(b)*Y(b)=W(b). The coefficients will be roughly b^2 each, and the final W(b) will be an addition like,
high low +-------+-------+ | w4 | +-------+-------+ +--------+-------+ | w3 | +--------+-------+ +--------+-------+ | w2 | +--------+-------+ +--------+-------+ | w1 | +--------+-------+ +-------+-------+ | w0 | +-------+-------+
The w[i] coefficients could be formed by a simple set of cross products, like w4=x2*y2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but this would need all nine x[i]*y[j] for i,j=0,1,2, and would be equivalent merely to a basecase multiply. Instead the following approach is used.
X(t) and Y(t) are evaluated and multiplied at 5 points, giving values of W(t) at those points. In GMP the following points are used,
Point Value t=0 x0 * y0, which gives w0 immediately t=1 (x2+x1+x0) * (y2+y1+y0) t=-1 (x2-x1+x0) * (y2-y1+y0) t=2 (4*x2+2*x1+x0) * (4*y2+2*y1+y0) t=inf x2 * y2, which gives w4 immediately
At t=-1 the values can be negative and that’s handled using the absolute values and tracking the sign separately. At t=inf the value is actually X(t)*Y(t)/t^4 in the limit as t approaches infinity, but it’s much easier to think of as simply x2*y2 giving w4 immediately (much like x0*y0 at t=0 gives w0 immediately).
Each of the points substituted into W(t)=w4*t^4+…+w0 gives a linear combination of the w[i] coefficients, and the value of those combinations has just been calculated.
W(0) = w0 W(1) = w4 + w3 + w2 + w1 + w0 W(-1) = w4 - w3 + w2 - w1 + w0 W(2) = 16*w4 + 8*w3 + 4*w2 + 2*w1 + w0 W(inf) = w4
This is a set of five equations in five unknowns, and some elementary linear
algebra quickly isolates each w[i]. This involves adding or
subtracting one W(t) value from another, and a couple of divisions by
powers of 2 and one division by 3, the latter using the special
mpn_divexact_by3
(see Exact Division).
The conversion of W(t) values to the coefficients is interpolation. A polynomial of degree 4 like W(t) is uniquely determined by values known at 5 different points. The points are arbitrary and can be chosen to make the linear equations come out with a convenient set of steps for quickly isolating the w[i].
Squaring follows the same procedure as multiplication, but there’s only one
X(t) and it’s evaluated at the 5 points, and those values squared to
give values of W(t). The interpolation is then identical, and in fact
the same toom_interpolate_5pts
subroutine is used for both squaring and
multiplying.
Toom-3 is asymptotically O(N^1.465), the exponent being log(5)/log(3), representing 5 recursive multiplies of 1/3 the original size each. This is an improvement over Karatsuba at O(N^1.585), though Toom does more work in the evaluation and interpolation and so it only realizes its advantage above a certain size.
Near the crossover between Toom-3 and Karatsuba there’s generally a range of
sizes where the difference between the two is small.
MUL_TOOM33_THRESHOLD
is a somewhat arbitrary point in that range and
successive runs of the tune program can give different values due to small
variations in measuring. A graph of time versus size for the two shows the
effect, see tune/README.
At the fairly small sizes where the Toom-3 thresholds occur it’s worth remembering that the asymptotic behaviour for Karatsuba and Toom-3 can’t be expected to make accurate predictions, due of course to the big influence of all sorts of overheads, and the fact that only a few recursions of each are being performed. Even at large sizes there’s a good chance machine dependent effects like cache architecture will mean actual performance deviates from what might be predicted.
The formula given for the Karatsuba algorithm (see Karatsuba Multiplication) has an equivalent for Toom-3 involving only five multiplies, but this would be complicated and unenlightening.
An alternate view of Toom-3 can be found in Zuras (see References), using a vector to represent the x and y splits and a matrix multiplication for the evaluation and interpolation stages. The matrix inverses are not meant to be actually used, and they have elements with values much greater than in fact arise in the interpolation steps. The diagram shown for the 3-way is attractive, but again doesn’t have to be implemented that way and for example with a bit of rearrangement just one division by 6 can be done.
Next: Higher degree Toom'n'half, Previous: Toom 3-Way Multiplication, Up: Multiplication Algorithms [Index]
15.1.4 Toom 4-Way Multiplication
Karatsuba and Toom-3 split the operands into 2 and 3 coefficients, respectively. Toom-4 analogously splits the operands into 4 coefficients. Using the notation from the section on Toom-3 multiplication, we form two polynomials:
X(t) = x3*t^3 + x2*t^2 + x1*t + x0 Y(t) = y3*t^3 + y2*t^2 + y1*t + y0
X(t) and Y(t) are evaluated and multiplied at 7 points, giving values of W(t) at those points. In GMP the following points are used,
Point Value t=0 x0 * y0, which gives w0 immediately t=1/2 (x3+2*x2+4*x1+8*x0) * (y3+2*y2+4*y1+8*y0) t=-1/2 (-x3+2*x2-4*x1+8*x0) * (-y3+2*y2-4*y1+8*y0) t=1 (x3+x2+x1+x0) * (y3+y2+y1+y0) t=-1 (-x3+x2-x1+x0) * (-y3+y2-y1+y0) t=2 (8*x3+4*x2+2*x1+x0) * (8*y3+4*y2+2*y1+y0) t=inf x3 * y3, which gives w6 immediately
The number of additions and subtractions for Toom-4 is much larger than for Toom-3. But several subexpressions occur multiple times, for example x2+x0, occurs for both t=1 and t=-1.
Toom-4 is asymptotically O(N^1.404), the exponent being log(7)/log(4), representing 7 recursive multiplies of 1/4 the original size each.
Next: FFT Multiplication, Previous: Toom 4-Way Multiplication, Up: Multiplication Algorithms [Index]
15.1.5 Higher degree Toom’n’half
The Toom algorithms described above (see Toom 3-Way Multiplication, see Toom 4-Way Multiplication) generalizes to split into an arbitrary number of pieces. In general a split of two equally long operands into r pieces leads to evaluations and pointwise multiplications done at 2*r-1 points. To fully exploit symmetries it would be better to have a multiple of 4 points, that’s why for higher degree Toom’n’half is used.
Toom’n’half means that the existence of one more piece is considered for a single operand. It can be virtual, i.e. zero, or real, when the two operand are not exactly balanced. By choosing an even r, Toom-r+1/2 requires 2r points, a multiple of four.
The quadruplets of points include 0, inf, +1, -1 and +-2^i, +-2^-i . Each of them giving shortcuts for the evaluation phase and for some steps in the interpolation phase. Further tricks are used to reduce the memory footprint of the whole multiplication algorithm to a memory buffer equal in size to the result of the product.
Current GMP uses both Toom-6’n’half and Toom-8’n’half.
Next: Other Multiplication, Previous: Higher degree Toom'n'half, Up: Multiplication Algorithms [Index]
15.1.6 FFT Multiplication
At large to very large sizes a Fermat style FFT multiplication is used, following Schönhage and Strassen (see References). Descriptions of FFTs in various forms can be found in many textbooks, for instance Knuth section 4.3.3 part C or Lipson chapter IX. A brief description of the form used in GMP is given here.
The multiplication done is x*y mod 2^N+1, for a given N. A full product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x and y with high zero limbs. The modular product is the native form for the algorithm, so padding to get a full product is unavoidable.
The algorithm follows a split, evaluate, pointwise multiply, interpolate and
combine similar to that described above for Karatsuba and Toom-3. A k
parameter controls the split, with an FFT-k splitting into 2^k
pieces of M=N/2^k bits each. N must be a multiple of
(2^k)*mp_bits_per_limb
so
the split falls on limb boundaries, avoiding bit shifts in the split and
combine stages.
The evaluations, pointwise multiplications, and interpolation, are all done
modulo 2^N'+1 where N' is 2M+k+3 rounded up to a
multiple of 2^k and of mp_bits_per_limb
. The results of
interpolation will be the following negacyclic convolution of the input
pieces, and the choice of N' ensures these sums aren’t truncated.
--- \ b w[n] = / (-1) * x[i] * y[j] --- i+j==b*2^k+n b=0,1
The points used for the evaluation are g^i for i=0 to 2^k-1 where g=2^(2N'/2^k). g is a 2^k'th root of unity mod 2^N'+1, which produces necessary cancellations at the interpolation stage, and it’s also a power of 2 so the fast Fourier transforms used for the evaluation and interpolation do only shifts, adds and negations.
The pointwise multiplications are done modulo 2^N'+1 and either recurse into a further FFT or use a plain multiplication (Toom-3, Karatsuba or basecase), whichever is optimal at the size N'. The interpolation is an inverse fast Fourier transform. The resulting set of sums of x[i]*y[j] are added at appropriate offsets to give the final result.
Squaring is the same, but x is the only input so it’s one transform at the evaluate stage and the pointwise multiplies are squares. The interpolation is the same.
For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm, the exponent representing 2^k recursed
modular multiplies each 1/2^(k-1) the size of the original.
Each successive k is an asymptotic improvement, but overheads mean each
is only faster at bigger and bigger sizes. In the code, MUL_FFT_TABLE
and SQR_FFT_TABLE
are the thresholds where each k is used. Each
new k effectively swaps some multiplying for some shifts, adds and
overheads.
A mod 2^N+1 product can be formed with a normal
NxN->2N bit multiply plus a subtraction, so an FFT
and Toom-3 etc can be compared directly. A k=4 FFT at
O(N^1.333) can be expected to be the first faster than Toom-3 at
O(N^1.465). In practice this is what’s found, with
MUL_FFT_MODF_THRESHOLD
and SQR_FFT_MODF_THRESHOLD
being between
300 and 1000 limbs, depending on the CPU. So far it’s been found that only
very large FFTs recurse into pointwise multiplies above these sizes.
When an FFT is to give a full product, the change of N to 2N
doesn’t alter the theoretical complexity for a given k, but for the
purposes of considering where an FFT might be first used it can be assumed
that the FFT is recursing into a normal multiply and that on that basis it’s
doing 2^k recursed multiplies each 1/2^(k-2) the size of
the inputs, making it O(N^(k/(k-2))). This would mean
k=7 at O(N^1.4) would be the first FFT faster than Toom-3.
In practice MUL_FFT_THRESHOLD
and SQR_FFT_THRESHOLD
have been
found to be in the k=8 range, somewhere between 3000 and 10000 limbs.
The way N is split into 2^k pieces and then 2M+k+3 is
rounded up to a multiple of 2^k and mp_bits_per_limb
means that
when 2^k>=mp_bits_per_limb
the effective N is a
multiple of 2^(2k-1) bits. The +k+3 means some values of
N just under such a multiple will be rounded to the next. The
complexity calculations above assume that a favourable size is used, meaning
one which isn’t padded through rounding, and it’s also assumed that the extra
+k+3 bits are negligible at typical FFT sizes.
The practical effect of the 2^(2k-1) constraint is to introduce a
step-effect into measured speeds. For example k=8 will round N
up to a multiple of 32768 bits, so for a 32-bit limb there’ll be 512 limb
groups of sizes for which mpn_mul_n
runs at the same speed. Or for
k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc. In
practice it’s been found each k is used at quite small multiples of its
size constraint and so the step effect is quite noticeable in a time versus
size graph.
The threshold determinations currently measure at the mid-points of size
steps, but this is sub-optimal since at the start of a new step it can happen
that it’s better to go back to the previous k for a while. Something
more sophisticated for MUL_FFT_TABLE
and SQR_FFT_TABLE
will be
needed.
Next: Unbalanced Multiplication, Previous: FFT Multiplication, Up: Multiplication Algorithms [Index]
15.1.7 Other Multiplication
The Toom algorithms described above (see Toom 3-Way Multiplication, see Toom 4-Way Multiplication) generalizes to split into an arbitrary number of pieces, as per Knuth section 4.3.3 algorithm C. This is not currently used. The notes here are merely for interest.
In general a split into r+1 pieces is made, and evaluations and pointwise multiplications done at 2*r+1 points. A 4-way split does 7 pointwise multiplies, 5-way does 9, etc. Asymptotically an (r+1)-way algorithm is O(N^(log(2*r+1)/log(r+1))). Only the pointwise multiplications count towards big-O complexity, but the time spent in the evaluate and interpolate stages grows with r and has a significant practical impact, with the asymptotic advantage of each r realized only at bigger and bigger sizes. The overheads grow as O(N*r), whereas in an r=2^k FFT they grow only as O(N*log(r)).
Knuth algorithm C evaluates at points 0,1,2,…,2*r, but exercise 4 uses -r,…,0,…,r and the latter saves some small multiplies in the evaluate stage (or rather trades them for additions), and has a further saving of nearly half the interpolate steps. The idea is to separate odd and even final coefficients and then perform algorithm C steps C7 and C8 on them separately. The divisors at step C7 become j^2 and the multipliers at C8 become 2*t*j-j^2.
Splitting odd and even parts through positive and negative points can be thought of as using -1 as a square root of unity. If a 4th root of unity was available then a further split and speedup would be possible, but no such root exists for plain integers. Going to complex integers with i=sqrt(-1) doesn’t help, essentially because in Cartesian form it takes three real multiplies to do a complex multiply. The existence of 2^k'th roots of unity in a suitable ring or field lets the fast Fourier transform keep splitting and get to O(N*log(r)).
Floating point FFTs use complex numbers approximating Nth roots of unity. Some processors have special support for such FFTs. But these are not used in GMP since it’s very difficult to guarantee an exact result (to some number of bits). An occasional difference of 1 in the last bit might not matter to a typical signal processing algorithm, but is of course of vital importance to GMP.
Previous: Other Multiplication, Up: Multiplication Algorithms [Index]
15.1.8 Unbalanced Multiplication
Multiplication of operands with different sizes, both below
MUL_TOOM22_THRESHOLD
are done with plain schoolbook multiplication
(see Basecase Multiplication).
For really large operands, we invoke FFT directly.
For operands between these sizes, we use Toom inspired algorithms suggested by Alberto Zanoni and Marco Bodrato. The idea is to split the operands into polynomials of different degree. GMP currently splits the smaller operand onto 2 coefficients, i.e., a polynomial of degree 1, but the larger operand can be split into 2, 3, or 4 coefficients, i.e., a polynomial of degree 1 to 3.
Next: Greatest Common Divisor Algorithms, Previous: Multiplication Algorithms, Up: Algorithms [Index]
15.2 Division Algorithms
• Single Limb Division | ||
• Basecase Division | ||
• Divide and Conquer Division | ||
• Block-Wise Barrett Division | ||
• Exact Division | ||
• Exact Remainder | ||
• Small Quotient Division |
Next: Basecase Division, Previous: Division Algorithms, Up: Division Algorithms [Index]
15.2.1 Single Limb Division
Nx1 division is implemented using repeated 2x1 divisions from high to low, either with a hardware divide instruction or a multiplication by inverse, whichever is best on a given CPU.
The multiply by inverse follows “Improved division by invariant integers” by
Möller and Granlund (see References) and is implemented as
udiv_qrnnd_preinv
in gmp-impl.h. The idea is to have a
fixed-point approximation to 1/d (see invert_limb
) and then
multiply by the high limb (plus one bit) of the dividend to get a quotient
q. With d normalized (high bit set), q is no more than 1
too small. Subtracting q*d from the dividend gives a remainder, and
reveals whether q or q-1 is correct.
The result is a division done with two multiplications and four or five arithmetic operations. On CPUs with low latency multipliers this can be much faster than a hardware divide, though the cost of calculating the inverse at the start may mean it’s only better on inputs bigger than say 4 or 5 limbs.
When a divisor must be normalized, either for the generic C
__udiv_qrnnd_c
or the multiply by inverse, the division performed is
actually a*2^k by d*2^k where a is the dividend and
k is the power necessary to have the high bit of d*2^k set.
The bit shifts for the dividend are usually accomplished “on the fly”
meaning by extracting the appropriate bits at each step. Done this way the
quotient limbs come out aligned ready to store. When only the remainder is
wanted, an alternative is to take the dividend limbs unshifted and calculate
r = a mod d*2^k followed by an extra final step r*2^k mod d*2^k. This can help on CPUs with poor bit shifts or
few registers.
The multiply by inverse can be done two limbs at a time. The calculation is basically the same, but the inverse is two limbs and the divisor treated as if padded with a low zero limb. This means more work, since the inverse will need a 2x2 multiply, but the four 1x1s to do that are independent and can therefore be done partly or wholly in parallel. Likewise for a 2x1 calculating q*d. The net effect is to process two limbs with roughly the same two multiplies worth of latency that one limb at a time gives. This extends to 3 or 4 limbs at a time, though the extra work to apply the inverse will almost certainly soon reach the limits of multiplier throughput.
A similar approach in reverse can be taken to process just half a limb at a time if the divisor is only a half limb. In this case the 1x1 multiply for the inverse effectively becomes two (1/2)x1 for each limb, which can be a saving on CPUs with a fast half limb multiply, or in fact if the only multiply is a half limb, and especially if it’s not pipelined.
Next: Divide and Conquer Division, Previous: Single Limb Division, Up: Division Algorithms [Index]
15.2.2 Basecase Division
Basecase NxM division is like long division done by hand, but in base 2^mp_bits_per_limb. See Knuth section 4.3.1 algorithm D, and mpn/generic/sb_divrem_mn.c.
Briefly stated, while the dividend remains larger than the divisor, a high quotient limb is formed and the Nx1 product q*d subtracted at the top end of the dividend. With a normalized divisor (most significant bit set), each quotient limb can be formed with a 2x1 division and a 1x1 multiplication plus some subtractions. The 2x1 division is by the high limb of the divisor and is done either with a hardware divide or a multiply by inverse (the same as in Single Limb Division) whichever is faster. Such a quotient is sometimes one too big, requiring an addback of the divisor, but that happens rarely.
With Q=N-M being the number of quotient limbs, this is an O(Q*M) algorithm and will run at a speed similar to a basecase QxM multiplication, differing in fact only in the extra multiply and divide for each of the Q quotient limbs.
Next: Block-Wise Barrett Division, Previous: Basecase Division, Up: Division Algorithms [Index]
15.2.3 Divide and Conquer Division
For divisors larger than DC_DIV_QR_THRESHOLD
, division is done by dividing.
Or to be precise by a recursive divide and conquer algorithm based on work by
Moenck and Borodin, Jebelean, and Burnikel and Ziegler (see References).
The algorithm consists essentially of recognising that a 2NxN division can be done with the basecase division algorithm (see Basecase Division), but using N/2 limbs as a base, not just a single limb. This way the multiplications that arise are (N/2)x(N/2) and can take advantage of Karatsuba and higher multiplication algorithms (see Multiplication Algorithms). The two “digits” of the quotient are formed by recursive Nx(N/2) divisions.
If the (N/2)x(N/2) multiplies are done with a basecase multiplication
then the work is about the same as a basecase division, but with more function
call overheads and with some subtractions separated from the multiplies.
These overheads mean that it’s only when N/2 is above
MUL_TOOM22_THRESHOLD
that divide and conquer is of use.
DC_DIV_QR_THRESHOLD
is based on the divisor size N, so it will be somewhere
above twice MUL_TOOM22_THRESHOLD
, but how much above depends on the
CPU. An optimized mpn_mul_basecase
can lower DC_DIV_QR_THRESHOLD
a
little by offering a ready-made advantage over repeated mpn_submul_1
calls.
Divide and conquer is asymptotically O(M(N)*log(N)) where M(N) is the time for an NxN multiplication done with FFTs. The actual time is a sum over multiplications of the recursed sizes, as can be seen near the end of section 2.2 of Burnikel and Ziegler. For example, within the Toom-3 range, divide and conquer is 2.63*M(N). With higher algorithms the M(N) term improves and the multiplier tends to log(N). In practice, at moderate to large sizes, a 2NxN division is about 2 to 4 times slower than an NxN multiplication.
Next: Exact Division, Previous: Divide and Conquer Division, Up: Division Algorithms [Index]
15.2.4 Block-Wise Barrett Division
For the largest divisions, a block-wise Barrett division algorithm is used. Here, the divisor is inverted to a precision determined by the relative size of the dividend and divisor. Blocks of quotient limbs are then generated by multiplying blocks from the dividend by the inverse.
Our block-wise algorithm computes a smaller inverse than in the plain Barrett algorithm. For a 2n/n division, the inverse will be just ceil(n/2) limbs.
Next: Exact Remainder, Previous: Block-Wise Barrett Division, Up: Division Algorithms [Index]
15.2.5 Exact Division
A so-called exact division is when the dividend is known to be an exact multiple of the divisor. Jebelean’s exact division algorithm uses this knowledge to make some significant optimizations (see References).
The idea can be illustrated in decimal for example with 368154 divided by 543. Because the low digit of the dividend is 4, the low digit of the quotient must be 8. This is arrived at from 4*7 mod 10, using the fact 7 is the modular inverse of 3 (the low digit of the divisor), since 3*7 ≡ 1 mod 10. So 8*543=4344 can be subtracted from the dividend leaving 363810. Notice the low digit has become zero.
The procedure is repeated at the second digit, with the next quotient digit 7 (7 ≡ 1*7 mod 10), subtracting 7*543=3801, leaving 325800. And finally at the third digit with quotient digit 6 (8*7 mod 10), subtracting 6*543=3258 leaving 0. So the quotient is 678.
Notice however that the multiplies and subtractions don’t need to extend past the low three digits of the dividend, since that’s enough to determine the three quotient digits. For the last quotient digit no subtraction is needed at all. On a 2NxN division like this one, only about half the work of a normal basecase division is necessary.
For an NxM exact division producing Q=N-M quotient limbs, the saving over a normal basecase division is in two parts. Firstly, each of the Q quotient limbs needs only one multiply, not a 2x1 divide and multiply. Secondly, the crossproducts are reduced when Q>M to Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2. Notice the savings are complementary. If Q is big then many divisions are saved, or if Q is small then the crossproducts reduce to a small number.
The modular inverse used is calculated efficiently by binvert_limb
in
gmp-impl.h. This does four multiplies for a 32-bit limb, or six for a
64-bit limb. tune/modlinv.c has some alternate implementations that
might suit processors better at bit twiddling than multiplying.
The sub-quadratic exact division described by Jebelean in “Exact Division with Karatsuba Complexity” is not currently implemented. It uses a rearrangement similar to the divide and conquer for normal division (see Divide and Conquer Division), but operating from low to high. A further possibility not currently implemented is “Bidirectional Exact Integer Division” by Krandick and Jebelean which forms quotient limbs from both the high and low ends of the dividend, and can halve once more the number of crossproducts needed in a 2NxN division.
A special case exact division by 3 exists in mpn_divexact_by3
,
supporting Toom-3 multiplication and mpq
canonicalizations. It forms
quotient digits with a multiply by the modular inverse of 3 (which is
0xAA..AAB
) and uses two comparisons to determine a borrow for the next
limb. The multiplications don’t need to be on the dependent chain, as long as
the effect of the borrows is applied, which can help chips with pipelined
multipliers.
Next: Small Quotient Division, Previous: Exact Division, Up: Division Algorithms [Index]
15.2.6 Exact Remainder
If the exact division algorithm is done with a full subtraction at each stage and the dividend isn’t a multiple of the divisor, then low zero limbs are produced but with a remainder in the high limbs. For dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, this remainder r is of the form
a = q*d + r*b^n
n represents the number of zero limbs produced by the subtractions, that being the number of limbs produced for q. r will be in the range 0<=r<d and can be viewed as a remainder, but one shifted up by a factor of b^n.
Carrying out full subtractions at each stage means the same number of cross products must be done as a normal division, but there’s still some single limb divisions saved. When d is a single limb some simplifications arise, providing good speedups on a number of processors.
The functions mpn_divexact_by3
, mpn_modexact_1_odd
and the
internal mpn_redc_X
functions differ subtly in how they return r,
leading to some negations in the above formula, but all are essentially the
same.
Clearly r is zero when a is a multiple of d, and this leads to divisibility or congruence tests which are potentially more efficient than a normal division.
The factor of b^n on r can be ignored in a GCD when d is
odd, hence the use of mpn_modexact_1_odd
by mpn_gcd_1
and
mpz_kronecker_ui
etc (see Greatest Common Divisor Algorithms).
Montgomery’s REDC method for modular multiplications uses operands of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n) uses the factor of b^n in the exact remainder to reach a product in the same form (x*y)*b^-n (see Modular Powering Algorithm).
Notice that r generally gives no useful information about the ordinary
remainder a mod d since b^n mod d could be anything. If
however b^n ≡ 1 mod d, then r is the negative of the
ordinary remainder. This occurs whenever d is a factor of
b^n-1, as for example with 3 in mpn_divexact_by3
. For a 32 or
64 bit limb other such factors include 5, 17 and 257, but no particular use
has been found for this.
Previous: Exact Remainder, Up: Division Algorithms [Index]
15.2.7 Small Quotient Division
An NxM division where the number of quotient limbs Q=N-M is small can be optimized somewhat.
An ordinary basecase division normalizes the divisor by shifting it to make the high bit set, shifting the dividend accordingly, and shifting the remainder back down at the end of the calculation. This is wasteful if only a few quotient limbs are to be formed. Instead a division of just the top 2*Q limbs of the dividend by the top Q limbs of the divisor can be used to form a trial quotient. This requires only those limbs normalized, not the whole of the divisor and dividend.
A multiply and subtract then applies the trial quotient to the M-Q unused limbs of the divisor and N-Q dividend limbs (which includes Q limbs remaining from the trial quotient division). The starting trial quotient can be 1 or 2 too big, but all cases of 2 too big and most cases of 1 too big are detected by first comparing the most significant limbs that will arise from the subtraction. An addback is done if the quotient still turns out to be 1 too big.
This whole procedure is essentially the same as one step of the basecase algorithm done in a Q limb base, though with the trial quotient test done only with the high limbs, not an entire Q limb “digit” product. The correctness of this weaker test can be established by following the argument of Knuth section 4.3.1 exercise 20 but with the v2*q>b*r+u2 condition appropriately relaxed.
Next: Powering Algorithms, Previous: Division Algorithms, Up: Algorithms [Index]
15.3 Greatest Common Divisor
• Binary GCD | ||
• Lehmer's Algorithm | ||
• Subquadratic GCD | ||
• Extended GCD | ||
• Jacobi Symbol |
Next: Lehmer's Algorithm, Previous: Greatest Common Divisor Algorithms, Up: Greatest Common Divisor Algorithms [Index]
15.3.1 Binary GCD
At small sizes GMP uses an O(N^2) binary style GCD. This is described in many textbooks, for example Knuth section 4.5.2 algorithm B. It simply consists of successively reducing odd operands a and b using
a,b = abs(a-b),min(a,b)
strip factors of 2 from a
The Euclidean GCD algorithm, as per Knuth algorithms E and A, repeatedly computes the quotient q = floor(a/b) and replaces a,b by v, u - q v. The binary algorithm has so far been found to be faster than the Euclidean algorithm everywhere. One reason the binary method does well is that the implied quotient at each step is usually small, so often only one or two subtractions are needed to get the same effect as a division. Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth section 4.5.3 Theorem E.
When the implied quotient is large, meaning b is much smaller than
a, then a division is worthwhile. This is the basis for the initial
a mod b reductions in mpn_gcd
and mpn_gcd_1
(the latter
for both Nx1 and 1x1 cases). But after that initial reduction,
big quotients occur too rarely to make it worth checking for them.
The final 1x1 GCD in mpn_gcd_1
is done in the generic C
code as described above. For two N-bit operands, the algorithm takes about
0.68 iterations per bit. For optimum performance some attention needs to be
paid to the way the factors of 2 are stripped from a.
Firstly it may be noted that in twos complement the number of low zero bits on a-b is the same as b-a, so counting or testing can begin on a-b without waiting for abs(a-b) to be determined.
A loop stripping low zero bits tends not to branch predict well, since the condition is data dependent. But on average there’s only a few low zeros, so an option is to strip one or two bits arithmetically then loop for more (as done for AMD K6). Or use a lookup table to get a count for several bits then loop for more (as done for AMD K7). An alternative approach is to keep just one of a or b odd and iterate
a,b = abs(a-b), min(a,b)
a = a/2 if even
b = b/2 if even
This requires about 1.25 iterations per bit, but stripping of a single bit at each step avoids any branching. Repeating the bit strip reduces to about 0.9 iterations per bit, which may be a worthwhile tradeoff.
Generally with the above approaches a speed of perhaps 6 cycles per bit can be achieved, which is still not terribly fast with for instance a 64-bit GCD taking nearly 400 cycles. It’s this sort of time which means it’s not usually advantageous to combine a set of divisibility tests into a GCD.
Currently, the binary algorithm is used for GCD only when N < 3.
Next: Subquadratic GCD, Previous: Binary GCD, Up: Greatest Common Divisor Algorithms [Index]
15.3.2 Lehmer’s algorithm
Lehmer’s improvement of the Euclidean algorithms is based on the observation
that the initial part of the quotient sequence depends only on the most
significant parts of the inputs. The variant of Lehmer’s algorithm used in GMP
splits off the most significant two limbs, as suggested, e.g., in “A
Double-Digit Lehmer-Euclid Algorithm” by Jebelean (see References). The
quotients of two double-limb inputs are collected as a 2 by 2 matrix with
single-limb elements. This is done by the function mpn_hgcd2
. The
resulting matrix is applied to the inputs using mpn_mul_1
and
mpn_submul_1
. Each iteration usually reduces the inputs by almost one
limb. In the rare case of a large quotient, no progress can be made by
examining just the most significant two limbs, and the quotient is computed
using plain division.
The resulting algorithm is asymptotically O(N^2), just as the Euclidean
algorithm and the binary algorithm. The quadratic part of the work are
the calls to mpn_mul_1
and mpn_submul_1
. For small sizes, the
linear work is also significant. There are roughly N calls to the
mpn_hgcd2
function. This function uses a couple of important
optimizations:
- It uses the same relaxed notion of correctness as
mpn_hgcd
(see next section). This means that when called with the most significant two limbs of two large numbers, the returned matrix does not always correspond exactly to the initial quotient sequence for the two large numbers; the final quotient may sometimes be one off. - It takes advantage of the fact the quotients are usually small. The division operator is not used, since the corresponding assembler instruction is very slow on most architectures. (This code could probably be improved further, it uses many branches that are unfriendly to prediction).
- It switches from double-limb calculations to single-limb calculations half-way through, when the input numbers have been reduced in size from two limbs to one and a half.
Next: Extended GCD, Previous: Lehmer's Algorithm, Up: Greatest Common Divisor Algorithms [Index]
15.3.3 Subquadratic GCD
For inputs larger than GCD_DC_THRESHOLD
, GCD is computed via the HGCD
(Half GCD) function, as a generalization to Lehmer’s algorithm.
Let the inputs a,b be of size N limbs each. Put S = floor(N/2) + 1. Then HGCD(a,b) returns a transformation matrix T with non-negative elements, and reduced numbers (c;d) = T^{-1} (a;b). The reduced numbers c,d must be larger than S limbs, while their difference abs(c-d) must fit in S limbs. The matrix elements will also be of size roughly N/2.
The HGCD base case uses Lehmer’s algorithm, but with the above stop condition
that returns reduced numbers and the corresponding transformation matrix
half-way through. For inputs larger than HGCD_THRESHOLD
, HGCD is
computed recursively, using the divide and conquer algorithm in “On
Schönhage’s algorithm and subquadratic integer GCD computation” by Möller
(see References). The recursive algorithm consists of these main
steps.
- Call HGCD recursively, on the most significant N/2 limbs. Apply the resulting matrix T_1 to the full numbers, reducing them to a size just above 3N/2.
- Perform a small number of division or subtraction steps to reduce the numbers to size below 3N/2. This is essential mainly for the unlikely case of large quotients.
- Call HGCD recursively, on the most significant N/2 limbs of the reduced numbers. Apply the resulting matrix T_2 to the full numbers, reducing them to a size just above N/2.
- Compute T = T_1 T_2.
- Perform a small number of division and subtraction steps to satisfy the requirements, and return.
GCD is then implemented as a loop around HGCD, similarly to Lehmer’s
algorithm. Where Lehmer repeatedly chops off the top two limbs, calls
mpn_hgcd2
, and applies the resulting matrix to the full numbers, the
sub-quadratic GCD chops off the most significant third of the limbs (the
proportion is a tuning parameter, and 1/3 seems to be more efficient
than, e.g, 1/2), calls mpn_hgcd
, and applies the resulting
matrix. Once the input numbers are reduced to size below
GCD_DC_THRESHOLD
, Lehmer’s algorithm is used for the rest of the work.
The asymptotic running time of both HGCD and GCD is O(M(N)*log(N)), where M(N) is the time for multiplying two N-limb numbers.
Next: Jacobi Symbol, Previous: Subquadratic GCD, Up: Greatest Common Divisor Algorithms [Index]
15.3.4 Extended GCD
The extended GCD function, or GCDEXT, calculates gcd(a,b) and also
cofactors x and y satisfying a*x+b*y=gcd(a,b). All the algorithms used for plain GCD are extended to
handle this case. The binary algorithm is used only for single-limb GCDEXT.
Lehmer’s algorithm is used for sizes up to GCDEXT_DC_THRESHOLD
. Above
this threshold, GCDEXT is implemented as a loop around HGCD, but with more
book-keeping to keep track of the cofactors. This gives the same asymptotic
running time as for GCD and HGCD, O(M(N)*log(N))
One difference to plain GCD is that while the inputs a and b are
reduced as the algorithm proceeds, the cofactors x and y grow in
size. This makes the tuning of the chopping-point more difficult. The current
code chops off the most significant half of the inputs for the call to HGCD in
the first iteration, and the most significant two thirds for the remaining
calls. This strategy could surely be improved. Also the stop condition for the
loop, where Lehmer’s algorithm is invoked once the inputs are reduced below
GCDEXT_DC_THRESHOLD
, could maybe be improved by taking into account the
current size of the cofactors.
Previous: Extended GCD, Up: Greatest Common Divisor Algorithms [Index]
15.3.5 Jacobi Symbol
Jacobi symbol (a/b)
Initially if either operand fits in a single limb, a reduction is done with
either mpn_mod_1
or mpn_modexact_1_odd
, followed by the binary
algorithm on a single limb. The binary algorithm is well suited to a single limb,
and the whole calculation in this case is quite efficient.
For inputs larger than GCD_DC_THRESHOLD
, mpz_jacobi
,
mpz_legendre
and mpz_kronecker
are computed via the HGCD (Half
GCD) function, as a generalization to Lehmer’s algorithm.
Most GCD algorithms reduce a and b by repeatatily computing the quotient q = floor(a/b) and iteratively replacing
a, b = b, a - q * b
Different algorithms use different methods for calculating q, but the core algorithm is the same if we use Lehmer's Algorithm or HGCD.
At each step it is possible to compute if the reduction inverts the Jacobi symbol based on the two least significant bits of a and b. For more details see “Efficient computation of the Jacobi symbol” by Möller (see References).
A small set of bits is thus used to track state
- current sign of result (1 bit)
- two least significant bits of a and b (4 bits)
- a pointer to which input is currently the denominator (1 bit)
In all the routines sign changes for the result are accumulated using fast bit twiddling which avoids conditional jumps.
The final result is calculated after verifying the inputs are coprime (GCD = 1) by raising (-1)^e
Much of the HGCD code is shared directly with the HGCD implementations, such
as the 2x2 matrix calculation, See Lehmer's Algorithm basecase and
GCD_DC_THRESHOLD
.
The asymptotic running time is O(M(N)*log(N)), where M(N) is the time for multiplying two N-limb numbers.
Next: Root Extraction Algorithms, Previous: Greatest Common Divisor Algorithms, Up: Algorithms [Index]
15.4 Powering Algorithms
• Normal Powering Algorithm | ||
• Modular Powering Algorithm |
Next: Modular Powering Algorithm, Previous: Powering Algorithms, Up: Powering Algorithms [Index]
15.4.1 Normal Powering
Normal mpz
or mpf
powering uses a simple binary algorithm,
successively squaring and then multiplying by the base when a 1 bit is seen in
the exponent, as per Knuth section 4.6.3. The “left to right”
variant described there is used rather than algorithm A, since it’s just as
easy and can be done with somewhat less temporary memory.
Previous: Normal Powering Algorithm, Up: Powering Algorithms [Index]
15.4.2 Modular Powering
Modular powering is implemented using a 2^k-ary sliding window algorithm, as per “Handbook of Applied Cryptography” algorithm 14.85 (see References). k is chosen according to the size of the exponent. Larger exponents use larger values of k, the choice being made to minimize the average number of multiplications that must supplement the squaring.
The modular multiplies and squarings use either a simple division or the REDC method by Montgomery (see References). REDC is a little faster, essentially saving N single limb divisions in a fashion similar to an exact remainder (see Exact Remainder).
Next: Radix Conversion Algorithms, Previous: Powering Algorithms, Up: Algorithms [Index]
15.5 Root Extraction Algorithms
• Square Root Algorithm | ||
• Nth Root Algorithm | ||
• Perfect Square Algorithm | ||
• Perfect Power Algorithm |
Next: Nth Root Algorithm, Previous: Root Extraction Algorithms, Up: Root Extraction Algorithms [Index]
15.5.1 Square Root
Square roots are taken using the “Karatsuba Square Root” algorithm by Paul Zimmermann (see References).
An input n is split into four parts of k bits each, so with b=2^k we have n = a3*b^3 + a2*b^2 + a1*b + a0. Part a3 must be “normalized” so that either the high or second highest bit is set. In GMP, k is kept on a limb boundary and the input is left shifted (by an even number of bits) to normalize.
The square root of the high two parts is taken, by recursive application of the algorithm (bottoming out in a one-limb Newton’s method),
s1,r1 = sqrtrem (a3*b + a2)
This is an approximation to the desired root and is extended by a division to give s,r,
q,u = divrem (r1*b + a1, 2*s1) s = s1*b + q r = u*b + a0 - q^2
The normalization requirement on a3 means at this point s is either correct or 1 too big. r is negative in the latter case, so
if r < 0 then r = r + 2*s - 1 s = s - 1
The algorithm is expressed in a divide and conquer form, but as noted in the paper it can also be viewed as a discrete variant of Newton’s method, or as a variation on the schoolboy method (no longer taught) for square roots two digits at a time.
If the remainder r is not required then usually only a few high limbs of r and u need to be calculated to determine whether an adjustment to s is required. This optimization is not currently implemented.
In the Karatsuba multiplication range this algorithm is O(1.5*M(N/2)), where M(n) is the time to multiply two numbers of n limbs. In the FFT multiplication range this grows to a bound of O(6*M(N/2)). In practice a factor of about 1.5 to 1.8 is found in the Karatsuba and Toom-3 ranges, growing to 2 or 3 in the FFT range.
The algorithm does all its calculations in integers and the resulting
mpn_sqrtrem
is used for both mpz_sqrt
and mpf_sqrt
.
The extended precision given by mpf_sqrt_ui
is obtained by
padding with zero limbs.
Next: Perfect Square Algorithm, Previous: Square Root Algorithm, Up: Root Extraction Algorithms [Index]
15.5.2 Nth Root
Integer Nth roots are taken using Newton’s method with the following iteration, where A is the input and n is the root to be taken.
1 A a[i+1] = - * ( --------- + (n-1)*a[i] ) n a[i]^(n-1)
The initial approximation a[1] is generated bitwise by successively powering a trial root with or without new 1 bits, aiming to be just above the true root. The iteration converges quadratically when started from a good approximation. When n is large more initial bits are needed to get good convergence. The current implementation is not particularly well optimized.
Next: Perfect Power Algorithm, Previous: Nth Root Algorithm, Up: Root Extraction Algorithms [Index]
15.5.3 Perfect Square
A significant fraction of non-squares can be quickly identified by checking whether the input is a quadratic residue modulo small integers.
mpz_perfect_square_p
first tests the input mod 256, which means just
examining the low byte. Only 44 different values occur for squares mod 256,
so 82.8% of inputs can be immediately identified as non-squares.
On a 32-bit system similar tests are done mod 9, 5, 7, 13 and 17, for a total 99.25% of inputs identified as non-squares. On a 64-bit system 97 is tested too, for a total 99.62%.
These moduli are chosen because they’re factors of 2^24-1 (or
2^48-1 for 64-bits), and such a remainder can be quickly taken just
using additions (see mpn_mod_34lsub1
).
When nails are in use moduli are instead selected by the gen-psqr.c
program and applied with an mpn_mod_1
. The same 2^24-1 or
2^48-1 could be done with nails using some extra bit shifts, but
this is not currently implemented.
In any case each modulus is applied to the mpn_mod_34lsub1
or
mpn_mod_1
remainder and a table lookup identifies non-squares. By
using a “modexact” style calculation, and suitably permuted tables, just one
multiply each is required, see the code for details. Moduli are also combined
to save operations, so long as the lookup tables don’t become too big.
gen-psqr.c does all the pre-calculations.
A square root must still be taken for any value that passes these tests, to verify it’s really a square and not one of the small fraction of non-squares that get through (i.e. a pseudo-square to all the tested bases).
Clearly more residue tests could be done, mpz_perfect_square_p
only
uses a compact and efficient set. Big inputs would probably benefit from more
residue testing, small inputs might be better off with less. The assumed
distribution of squares versus non-squares in the input would affect such
considerations.
Previous: Perfect Square Algorithm, Up: Root Extraction Algorithms [Index]
15.5.4 Perfect Power
Detecting perfect powers is required by some factorization algorithms.
Currently mpz_perfect_power_p
is implemented using repeated Nth root
extractions, though naturally only prime roots need to be considered.
(See Nth Root Algorithm.)
If a prime divisor p with multiplicity e can be found, then only roots which are divisors of e need to be considered, much reducing the work necessary. To this end divisibility by a set of small primes is checked.
Next: Other Algorithms, Previous: Root Extraction Algorithms, Up: Algorithms [Index]
15.6 Radix Conversion
Radix conversions are less important than other algorithms. A program dominated by conversions should probably use a different data representation.
• Binary to Radix | ||
• Radix to Binary |
Next: Radix to Binary, Previous: Radix Conversion Algorithms, Up: Radix Conversion Algorithms [Index]
15.6.1 Binary to Radix
Conversions from binary to a power-of-2 radix use a simple and fast O(N) bit extraction algorithm.
Conversions from binary to other radices use one of two algorithms. Sizes
below GET_STR_PRECOMPUTE_THRESHOLD
use a basic O(N^2) method.
Repeated divisions by b^n are made, where b is the radix and
n is the biggest power that fits in a limb. But instead of simply
using the remainder r from such divisions, an extra divide step is done
to give a fractional limb representing r/b^n. The digits of r
can then be extracted using multiplications by b rather than divisions.
Special case code is provided for decimal, allowing multiplications by 10 to
optimize to shifts and adds.
Above GET_STR_PRECOMPUTE_THRESHOLD
a sub-quadratic algorithm is used.
For an input t, powers b^(n*2^i) of the radix are
calculated, until a power between t and sqrt(t) is
reached. t is then divided by that largest power, giving a quotient
which is the digits above that power, and a remainder which is those below.
These two parts are in turn divided by the second highest power, and so on
recursively. When a piece has been divided down to less than
GET_STR_DC_THRESHOLD
limbs, the basecase algorithm described above is
used.
The advantage of this algorithm is that big divisions can make use of the sub-quadratic divide and conquer division (see Divide and Conquer Division), and big divisions tend to have less overheads than lots of separate single limb divisions anyway. But in any case the cost of calculating the powers b^(n*2^i) must first be overcome.
GET_STR_PRECOMPUTE_THRESHOLD
and GET_STR_DC_THRESHOLD
represent
the same basic thing, the point where it becomes worth doing a big division to
cut the input in half. GET_STR_PRECOMPUTE_THRESHOLD
includes the cost
of calculating the radix power required, whereas GET_STR_DC_THRESHOLD
assumes that’s already available, which is the case when recursing.
Since the base case produces digits from least to most significant but they
want to be stored from most to least, it’s necessary to calculate in advance
how many digits there will be, or at least be sure not to underestimate that.
For GMP the number of input bits is multiplied by chars_per_bit_exactly
from mp_bases
, rounding up. The result is either correct or one too
big.
Examining some of the high bits of the input could increase the chance of getting the exact number of digits, but an exact result every time would not be practical, since in general the difference between numbers 100… and 99… is only in the last few bits and the work to identify 99… might well be almost as much as a full conversion.
The r/b^n scheme described above for using multiplications to bring out digits might be useful for more than a single limb. Some brief experiments with it on the base case when recursing didn’t give a noticeable improvement, but perhaps that was only due to the implementation. Something similar would work for the sub-quadratic divisions too, though there would be the cost of calculating a bigger radix power.
Another possible improvement for the sub-quadratic part would be to arrange for radix powers that balanced the sizes of quotient and remainder produced, i.e. the highest power would be an b^(n*k) approximately equal to sqrt(t), not restricted to a 2^i factor. That ought to smooth out a graph of times against sizes, but may or may not be a net speedup.
Previous: Binary to Radix, Up: Radix Conversion Algorithms [Index]
15.6.2 Radix to Binary
This section needs to be rewritten, it currently describes the algorithms used before GMP 4.3.
Conversions from a power-of-2 radix into binary use a simple and fast O(N) bitwise concatenation algorithm.
Conversions from other radices use one of two algorithms. Sizes below
SET_STR_PRECOMPUTE_THRESHOLD
use a basic O(N^2) method. Groups
of n digits are converted to limbs, where n is the biggest
power of the base b which will fit in a limb, then those groups are
accumulated into the result by multiplying by b^n and adding. This
saves multi-precision operations, as per Knuth section 4.4 part E
(see References). Some special case code is provided for decimal, giving
the compiler a chance to optimize multiplications by 10.
Above SET_STR_PRECOMPUTE_THRESHOLD
a sub-quadratic algorithm is used.
First groups of n digits are converted into limbs. Then adjacent
limbs are combined into limb pairs with x*b^n+y, where x
and y are the limbs. Adjacent limb pairs are combined into quads
similarly with x*b^(2n)+y. This continues until a single block
remains, that being the result.
The advantage of this method is that the multiplications for each x are
big blocks, allowing Karatsuba and higher algorithms to be used. But the cost
of calculating the powers b^(n*2^i) must be overcome.
SET_STR_PRECOMPUTE_THRESHOLD
usually ends up quite big, around 5000 digits, and on
some processors much bigger still.
SET_STR_PRECOMPUTE_THRESHOLD
is based on the input digits (and tuned
for decimal), though it might be better based on a limb count, so as to be
independent of the base. But that sort of count isn’t used by the base case
and so would need some sort of initial calculation or estimate.
The main reason SET_STR_PRECOMPUTE_THRESHOLD
is so much bigger than the
corresponding GET_STR_PRECOMPUTE_THRESHOLD
is that mpn_mul_1
is
much faster than mpn_divrem_1
(often by a factor of 5, or more).
Next: Assembly Coding, Previous: Radix Conversion Algorithms, Up: Algorithms [Index]
15.7 Other Algorithms
• Prime Testing Algorithm | ||
• Factorial Algorithm | ||
• Binomial Coefficients Algorithm | ||
• Fibonacci Numbers Algorithm | ||
• Lucas Numbers Algorithm | ||
• Random Number Algorithms |
Next: Factorial Algorithm, Previous: Other Algorithms, Up: Other Algorithms [Index]
15.7.1 Prime Testing
The primality testing in mpz_probab_prime_p
(see Number Theoretic Functions) first does some trial division by small factors and then uses the
Miller-Rabin probabilistic primality testing algorithm, as described in Knuth
section 4.5.4 algorithm P (see References).
For an odd input n, and with n = q*2^k+1 where q is odd, this algorithm selects a random base x and tests whether x^q mod n is 1 or -1, or an x^(q*2^j) mod n is 1, for 1<=j<=k. If so then n is probably prime, if not then n is definitely composite.
Any prime n will pass the test, but some composites do too. Such composites are known as strong pseudoprimes to base x. No n is a strong pseudoprime to more than 1/4 of all bases (see Knuth exercise 22), hence with x chosen at random there’s no more than a 1/4 chance a “probable prime” will in fact be composite.
In fact strong pseudoprimes are quite rare, making the test much more powerful than this analysis would suggest, but 1/4 is all that’s proven for an arbitrary n.
Next: Binomial Coefficients Algorithm, Previous: Prime Testing Algorithm, Up: Other Algorithms [Index]
15.7.2 Factorial
Factorials are calculated by a combination of two algorithms. An idea is shared among them: to compute the odd part of the factorial; a final step takes account of the power of 2 term, by shifting.
For small n, the odd factor of n! is computed with the simple observation that it is equal to the product of all positive odd numbers smaller than n times the odd factor of [n/2]!, where [x] is the integer part of x, and so on recursively. The procedure can be best illustrated with an example,
23! = (23.21.19.17.15.13.11.9.7.5.3)(11.9.7.5.3)(5.3)2^{19}
Current code collects all the factors in a single list, with a loop and no recursion, and compute the product, with no special care for repeated chunks.
When n is larger, computation pass trough prime sieving. An helper function is used, as suggested by Peter Luschny:
n ----- n! | | L(p,n) msf(n) = -------------- = | | p [n/2]!^2.2^k p=3
Where p ranges on odd prime numbers. The exponent k is chosen to obtain an odd integer number: k is the number of 1 bits in the binary representation of [n/2]. The function L(p,n) can be defined as zero when p is composite, and, for any prime p, it is computed with:
--- \ n L(p,n) = / [---] mod 2 <= log (n) . --- p^i p i>0
With this helper function, we are able to compute the odd part of n! using the recursion implied by n!=[n/2]!^2*msf(n)*2^k. The recursion stops using the small-n algorithm on some [n/2^i].
Both the above algorithms use binary splitting to compute the product of many small factors. At first as many products as possible are accumulated in a single register, generating a list of factors that fit in a machine word. This list is then split into halves, and the product is computed recursively.
Such splitting is more efficient than repeated Nx1 multiplies since it forms big multiplies, allowing Karatsuba and higher algorithms to be used. And even below the Karatsuba threshold a big block of work can be more efficient for the basecase algorithm.
Next: Fibonacci Numbers Algorithm, Previous: Factorial Algorithm, Up: Other Algorithms [Index]
15.7.3 Binomial Coefficients
Binomial coefficients C(n,k) are calculated by first arranging k <= n/2 using C(n,k) = C(n,n-k) if necessary, and then evaluating the following product simply from i=2 to i=k.
k (n-k+i) C(n,k) = (n-k+1) * prod ------- i=2 i
It’s easy to show that each denominator i will divide the product so far, so the exact division algorithm is used (see Exact Division).
The numerators n-k+i and denominators i are first accumulated
into as many fit a limb, to save multi-precision operations, though for
mpz_bin_ui
this applies only to the divisors, since n is an
mpz_t
and n-k+i in general won’t fit in a limb at all.
Next: Lucas Numbers Algorithm, Previous: Binomial Coefficients Algorithm, Up: Other Algorithms [Index]
15.7.4 Fibonacci Numbers
The Fibonacci functions mpz_fib_ui
and mpz_fib2_ui
are designed
for calculating isolated F[n] or F[n],F[n-1]
values efficiently.
For small n, a table of single limb values in __gmp_fib_table
is
used. On a 32-bit limb this goes up to F[47], or on a 64-bit limb
up to F[93]. For convenience the table starts at F[-1].
Beyond the table, values are generated with a binary powering algorithm, calculating a pair F[n] and F[n-1] working from high to low across the bits of n. The formulas used are
F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k F[2k-1] = F[k]^2 + F[k-1]^2 F[2k] = F[2k+1] - F[2k-1]
At each step, k is the high b bits of n. If the next bit of n is 0 then F[2k],F[2k-1] is used, or if it’s a 1 then F[2k+1],F[2k] is used, and the process repeated until all bits of n are incorporated. Notice these formulas require just two squares per bit of n.
It’d be possible to handle the first few n above the single limb table with simple additions, using the defining Fibonacci recurrence F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to be faster for only about 10 or 20 values of n, and including a block of code for just those doesn’t seem worthwhile. If they really mattered it’d be better to extend the data table.
Using a table avoids lots of calculations on small numbers, and makes small n go fast. A bigger table would make more small n go fast, it’s just a question of balancing size against desired speed. For GMP the code is kept compact, with the emphasis primarily on a good powering algorithm.
mpz_fib2_ui
returns both F[n] and F[n-1], but
mpz_fib_ui
is only interested in F[n]. In this case the last
step of the algorithm can become one multiply instead of two squares. One of
the following two formulas is used, according as n is odd or even.
F[2k] = F[k]*(F[k]+2F[k-1]) F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k
F[2k+1] here is the same as above, just rearranged to be a
multiply. For interest, the 2*(-1)^k term both here and above
can be applied just to the low limb of the calculation, without a carry or
borrow into further limbs, which saves some code size. See comments with
mpz_fib_ui
and the internal mpn_fib2_ui
for how this is done.
Next: Random Number Algorithms, Previous: Fibonacci Numbers Algorithm, Up: Other Algorithms [Index]
15.7.5 Lucas Numbers
mpz_lucnum2_ui
derives a pair of Lucas numbers from a pair of Fibonacci
numbers with the following simple formulas.
L[k] = F[k] + 2*F[k-1] L[k-1] = 2*F[k] - F[k-1]
mpz_lucnum_ui
is only interested in L[n], and some work can be
saved. Trailing zero bits on n can be handled with a single square
each.
L[2k] = L[k]^2 - 2*(-1)^k
And the lowest 1 bit can be handled with one multiply of a pair of Fibonacci
numbers, similar to what mpz_fib_ui
does.
L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k
Previous: Lucas Numbers Algorithm, Up: Other Algorithms [Index]
15.7.6 Random Numbers
For the urandomb
functions, random numbers are generated simply by
concatenating bits produced by the generator. As long as the generator has
good randomness properties this will produce well-distributed N bit
numbers.
For the urandomm
functions, random numbers in a range 0<=R<N
are generated by taking values R of ceil(log2(N)) bits each until one satisfies R<N. This will normally
require only one or two attempts, but the attempts are limited in case the
generator is somehow degenerate and produces only 1 bits or similar.
The Mersenne Twister generator is by Matsumoto and Nishimura (see References). It has a non-repeating period of 2^19937-1, which is a Mersenne prime, hence the name of the generator. The state is 624 words of 32-bits each, which is iterated with one XOR and shift for each 32-bit word generated, making the algorithm very fast. Randomness properties are also very good and this is the default algorithm used by GMP.
Linear congruential generators are described in many text books, for instance Knuth volume 2 (see References). With a modulus M and parameters A and C, an integer state S is iterated by the formula S <- A*S+C mod M. At each step the new state is a linear function of the previous, mod M, hence the name of the generator.
In GMP only moduli of the form 2^N are supported, and the current implementation is not as well optimized as it could be. Overheads are significant when N is small, and when N is large clearly the multiply at each step will become slow. This is not a big concern, since the Mersenne Twister generator is better in every respect and is therefore recommended for all normal applications.
For both generators the current state can be deduced by observing enough output and applying some linear algebra (over GF(2) in the case of the Mersenne Twister). This generally means raw output is unsuitable for cryptographic applications without further hashing or the like.
Previous: Other Algorithms, Up: Algorithms [Index]
15.8 Assembly Coding
The assembly subroutines in GMP are the most significant source of speed at small to moderate sizes. At larger sizes algorithm selection becomes more important, but of course speedups in low level routines will still speed up everything proportionally.
Carry handling and widening multiplies that are important for GMP can’t be
easily expressed in C. GCC asm
blocks help a lot and are provided in
longlong.h, but hand coding low level routines invariably offers a
speedup over generic C by a factor of anything from 2 to 10.
Next: Assembly Basics, Previous: Assembly Coding, Up: Assembly Coding [Index]
15.8.1 Code Organisation
The various mpn subdirectories contain machine-dependent code, written in C or assembly. The mpn/generic subdirectory contains default code, used when there’s no machine-specific version of a particular file.
Each mpn subdirectory is for an ISA family. Generally 32-bit and 64-bit variants in a family cannot share code and have separate directories. Within a family further subdirectories may exist for CPU variants.
In each directory a nails subdirectory may exist, holding code with
nails support for that CPU variant. A NAILS_SUPPORT
directive in each
file indicates the nails values the code handles. Nails code only exists
where it’s faster, or promises to be faster, than plain code. There’s no
effort put into nails if they’re not going to enhance a given CPU.
Next: Assembly Carry Propagation, Previous: Assembly Code Organisation, Up: Assembly Coding [Index]
15.8.2 Assembly Basics
mpn_addmul_1
and mpn_submul_1
are the most important routines
for overall GMP performance. All multiplications and divisions come down to
repeated calls to these. mpn_add_n
, mpn_sub_n
,
mpn_lshift
and mpn_rshift
are next most important.
On some CPUs assembly versions of the internal functions
mpn_mul_basecase
and mpn_sqr_basecase
give significant speedups,
mainly through avoiding function call overheads. They can also potentially
make better use of a wide superscalar processor, as can bigger primitives like
mpn_addmul_2
or mpn_addmul_4
.
The restrictions on overlaps between sources and destinations
(see Low-level Functions) are designed to facilitate a variety of
implementations. For example, knowing mpn_add_n
won’t have partly
overlapping sources and destination means reading can be done far ahead of
writing on superscalar processors, and loops can be vectorized on a vector
processor, depending on the carry handling.
Next: Assembly Cache Handling, Previous: Assembly Basics, Up: Assembly Coding [Index]
15.8.3 Carry Propagation
The problem that presents most challenges in GMP is propagating carries from
one limb to the next. In functions like mpn_addmul_1
and
mpn_add_n
, carries are the only dependencies between limb operations.
On processors with carry flags, a straightforward CISC style adc
is
generally best. AMD K6 mpn_addmul_1
however is an example of an
unusual set of circumstances where a branch works out better.
On RISC processors generally an add and compare for overflow is used. This sort of thing can be seen in mpn/generic/aors_n.c. Some carry propagation schemes require 4 instructions, meaning at least 4 cycles per limb, but other schemes may use just 1 or 2. On wide superscalar processors performance may be completely determined by the number of dependent instructions between carry-in and carry-out for each limb.
On vector processors good use can be made of the fact that a carry bit only
very rarely propagates more than one limb. When adding a single bit to a
limb, there’s only a carry out if that limb was 0xFF…FF
which on
random data will be only 1 in 2^mp_bits_per_limb. mpn/cray/add_n.c is an example of this, it adds
all limbs in parallel, adds one set of carry bits in parallel and then only
rarely needs to fall through to a loop propagating further carries.
On the x86s, GCC (as of version 2.95.2) doesn’t generate particularly good code
for the RISC style idioms that are necessary to handle carry bits in
C. Often conditional jumps are generated where adc
or sbb
forms
would be better. And so unfortunately almost any loop involving carry bits
needs to be coded in assembly for best results.
Next: Assembly Functional Units, Previous: Assembly Carry Propagation, Up: Assembly Coding [Index]
15.8.4 Cache Handling
GMP aims to perform well both on operands that fit entirely in L1 cache and those which don’t.
Basic routines like mpn_add_n
or mpn_lshift
are often used on
large operands, so L2 and main memory performance is important for them.
mpn_mul_1
and mpn_addmul_1
are mostly used for multiply and
square basecases, so L1 performance matters most for them, unless assembly
versions of mpn_mul_basecase
and mpn_sqr_basecase
exist, in
which case the remaining uses are mostly for larger operands.
For L2 or main memory operands, memory access times will almost certainly be more than the calculation time. The aim therefore is to maximize memory throughput, by starting a load of the next cache line while processing the contents of the previous one. Clearly this is only possible if the chip has a lock-up free cache or some sort of prefetch instruction. Most current chips have both these features.
Prefetching sources combines well with loop unrolling, since a prefetch can be initiated once per unrolled loop (or more than once if the loop covers more than one cache line).
On CPUs without write-allocate caches, prefetching destinations will ensure
individual stores don’t go further down the cache hierarchy, limiting
bandwidth. Of course for calculations which are slow anyway, like
mpn_divrem_1
, write-throughs might be fine.
The distance ahead to prefetch will be determined by memory latency versus throughput. The aim of course is to have data arriving continuously, at peak throughput. Some CPUs have limits on the number of fetches or prefetches in progress.
If a special prefetch instruction doesn’t exist then a plain load can be used, but in that case care must be taken not to attempt to read past the end of an operand, since that might produce a segmentation violation.
Some CPUs or systems have hardware that detects sequential memory accesses and initiates suitable cache movements automatically, making life easy.
Next: Assembly Floating Point, Previous: Assembly Cache Handling, Up: Assembly Coding [Index]
15.8.5 Functional Units
When choosing an approach for an assembly loop, consideration is given to what operations can execute simultaneously and what throughput can thereby be achieved. In some cases an algorithm can be tweaked to accommodate available resources.
Loop control will generally require a counter and pointer updates, costing as much as 5 instructions, plus any delays a branch introduces. CPU addressing modes might reduce pointer updates, perhaps by allowing just one updating pointer and others expressed as offsets from it, or on CISC chips with all addressing done with the loop counter as a scaled index.
The final loop control cost can be amortised by processing several limbs in each iteration (see Assembly Loop Unrolling). This at least ensures loop control isn’t a big fraction the work done.
Memory throughput is always a limit. If perhaps only one load or one store
can be done per cycle then 3 cycles/limb will the top speed for “binary”
operations like mpn_add_n
, and any code achieving that is optimal.
Integer resources can be freed up by having the loop counter in a float register, or by pressing the float units into use for some multiplying, perhaps doing every second limb on the float side (see Assembly Floating Point).
Float resources can be freed up by doing carry propagation on the integer side, or even by doing integer to float conversions in integers using bit twiddling.
Next: Assembly SIMD Instructions, Previous: Assembly Functional Units, Up: Assembly Coding [Index]
15.8.6 Floating Point
Floating point arithmetic is used in GMP for multiplications on CPUs with poor
integer multipliers. It’s mostly useful for mpn_mul_1
,
mpn_addmul_1
and mpn_submul_1
on 64-bit machines, and
mpn_mul_basecase
on both 32-bit and 64-bit machines.
With IEEE 53-bit double precision floats, integer multiplications producing up to 53 bits will give exact results. Breaking a 64x64 multiplication into eight 16x32->48 bit pieces is convenient. With some care though six 21x32->53 bit products can be used, if one of the lower two 21-bit pieces also uses the sign bit.
For the mpn_mul_1
family of functions on a 64-bit machine, the
invariant single limb is split at the start, into 3 or 4 pieces. Inside the
loop, the bignum operand is split into 32-bit pieces. Fast conversion of
these unsigned 32-bit pieces to floating point is highly machine-dependent.
In some cases, reading the data into the integer unit, zero-extending to
64-bits, then transferring to the floating point unit back via memory is the
only option.
Converting partial products back to 64-bit limbs is usually best done as a signed conversion. Since all values are smaller than 2^53, signed and unsigned are the same, but most processors lack unsigned conversions.
Here is a diagram showing 16x32 bit products for an mpn_mul_1
or
mpn_addmul_1
with a 64-bit limb. The single limb operand V is split
into four 16-bit parts. The multi-limb operand U is split in the loop into
two 32-bit parts.
+---+---+---+---+ |v48|v32|v16|v00| V operand +---+---+---+---+ +-------+---+---+ x | u32 | u00 | U operand (one limb) +---------------+ --------------------------------- +-----------+ | u00 x v00 | p00 48-bit products +-----------+ +-----------+ | u00 x v16 | p16 +-----------+ +-----------+ | u00 x v32 | p32 +-----------+ +-----------+ | u00 x v48 | p48 +-----------+ +-----------+ | u32 x v00 | r32 +-----------+ +-----------+ | u32 x v16 | r48 +-----------+ +-----------+ | u32 x v32 | r64 +-----------+ +-----------+ | u32 x v48 | r80 +-----------+
p32 and r32 can be summed using floating-point addition, and likewise p48 and r48. p00 and p16 can be summed with r64 and r80 from the previous iteration.
For each loop then, four 49-bit quantities are transferred to the integer unit, aligned as follows,
|-----64bits----|-----64bits----| +------------+ | p00 + r64' | i00 +------------+ +------------+ | p16 + r80' | i16 +------------+ +------------+ | p32 + r32 | i32 +------------+ +------------+ | p48 + r48 | i48 +------------+
The challenge then is to sum these efficiently and add in a carry limb, generating a low 64-bit result limb and a high 33-bit carry limb (i48 extends 33 bits into the high half).
Next: Assembly Software Pipelining, Previous: Assembly Floating Point, Up: Assembly Coding [Index]
15.8.7 SIMD Instructions
The single-instruction multiple-data support in current microprocessors is aimed at signal processing algorithms where each data point can be treated more or less independently. There’s generally not much support for propagating the sort of carries that arise in GMP.
SIMD multiplications of say four 16x16 bit multiplies only do as much work as one 32x32 from GMP’s point of view, and need some shifts and adds besides. But of course if say the SIMD form is fully pipelined and uses less instruction decoding then it may still be worthwhile.
On the x86 chips, MMX has so far found a use in mpn_rshift
and
mpn_lshift
, and is used in a special case for 16-bit multipliers in the
P55 mpn_mul_1
. SSE2 is used for Pentium 4 mpn_mul_1
,
mpn_addmul_1
, and mpn_submul_1
.
Next: Assembly Loop Unrolling, Previous: Assembly SIMD Instructions, Up: Assembly Coding [Index]
15.8.8 Software Pipelining
Software pipelining consists of scheduling instructions around the branch point in a loop. For example a loop might issue a load not for use in the present iteration but the next, thereby allowing extra cycles for the data to arrive from memory.
Naturally this is wanted only when doing things like loads or multiplies that take several cycles to complete, and only where a CPU has multiple functional units so that other work can be done in the meantime.
A pipeline with several stages will have a data value in progress at each stage and each loop iteration moves them along one stage. This is like juggling.
If the latency of some instruction is greater than the loop time then it will be necessary to unroll, so one register has a result ready to use while another (or multiple others) are still in progress. (see Assembly Loop Unrolling).
Next: Assembly Writing Guide, Previous: Assembly Software Pipelining, Up: Assembly Coding [Index]
15.8.9 Loop Unrolling
Loop unrolling consists of replicating code so that several limbs are
processed in each loop. At a minimum this reduces loop overheads by a
corresponding factor, but it can also allow better register usage, for example
alternately using one register combination and then another. Judicious use of
m4
macros can help avoid lots of duplication in the source code.
Any amount of unrolling can be handled with a loop counter that’s decremented by N each time, stopping when the remaining count is less than the further N the loop will process. Or by subtracting N at the start, the termination condition becomes when the counter C is less than 0 (and the count of remaining limbs is C+N).
Alternately for a power of 2 unroll the loop count and remainder can be established with a shift and mask. This is convenient if also making a computed jump into the middle of a large loop.
The limbs not a multiple of the unrolling can be handled in various ways, for example
- A simple loop at the end (or the start) to process the excess. Care will be wanted that it isn’t too much slower than the unrolled part.
- A set of binary tests, for example after an 8-limb unrolling, test for 4 more limbs to process, then a further 2 more or not, and finally 1 more or not. This will probably take more code space than a simple loop.
- A
switch
statement, providing separate code for each possible excess, for example an 8-limb unrolling would have separate code for 0 remaining, 1 remaining, etc, up to 7 remaining. This might take a lot of code, but may be the best way to optimize all cases in combination with a deep pipelined loop. - A computed jump into the middle of the loop, thus making the first iteration handle the excess. This should make times smoothly increase with size, which is attractive, but setups for the jump and adjustments for pointers can be tricky and could become quite difficult in combination with deep pipelining.
Previous: Assembly Loop Unrolling, Up: Assembly Coding [Index]
15.8.10 Writing Guide
This is a guide to writing software pipelined loops for processing limb vectors in assembly.
First determine the algorithm and which instructions are needed. Code it without unrolling or scheduling, to make sure it works. On a 3-operand CPU try to write each new value to a new register, this will greatly simplify later steps.
Then note for each instruction the functional unit and/or issue port requirements. If an instruction can use either of two units, like U0 or U1 then make a category “U0/U1”. Count the total using each unit (or combined unit), and count all instructions.
Figure out from those counts the best possible loop time. The goal will be to find a perfect schedule where instruction latencies are completely hidden. The total instruction count might be the limiting factor, or perhaps a particular functional unit. It might be possible to tweak the instructions to help the limiting factor.
Suppose the loop time is N, then make N issue buckets, with the final loop branch at the end of the last. Now fill the buckets with dummy instructions using the functional units desired. Run this to make sure the intended speed is reached.
Now replace the dummy instructions with the real instructions from the slow but correct loop you started with. The first will typically be a load instruction. Then the instruction using that value is placed in a bucket an appropriate distance down. Run the loop again, to check it still runs at target speed.
Keep placing instructions, frequently measuring the loop. After a few you will need to wrap around from the last bucket back to the top of the loop. If you used the new-register for new-value strategy above then there will be no register conflicts. If not then take care not to clobber something already in use. Changing registers at this time is very error prone.
The loop will overlap two or more of the original loop iterations, and the computation of one vector element result will be started in one iteration of the new loop, and completed one or several iterations later.
The final step is to create feed-in and wind-down code for the loop. A good way to do this is to make a copy (or copies) of the loop at the start and delete those instructions which don’t have valid antecedents, and at the end replicate and delete those whose results are unwanted (including any further loads).
The loop will have a minimum number of limbs loaded and processed, so the feed-in code must test if the request size is smaller and skip either to a suitable part of the wind-down or to special code for small sizes.
Previous: Assembly Loop Unrolling, Up: Assembly Coding [Index]