How to do fast multiplication using the FFT

Current Deep Learning workflows rely on thousands of integer multiplications, therefore getting an efficient multiplication performance is critical nowadays.

Adrian PD
Towards Data Science

--

Photo by Gsightfotos on Unsplash

The multiplication operation is the workhorse kernel in many scientific and Data Science applications. The complexity of this operation is not linear, thus scaling it in time can be a difficult task to achieve. Current Deep Learning workflows rely on thousands of integer multiplications, and this number grows together with the complexity of the DL architectures or available data to train models. Therefore, getting an efficient multiplication performance is critical nowadays. This article shows how to perform integer multiplications using the most-important signal discovery of the 20th century, the Fast Fourier Transform.

Not only Deep Learning convolutions depend on integer multiplication, other scientific and computing applications, such as rendering fractal images at high magnification and public-key cryptography, rely on integer multiplication. The classical vector multiplication has a complexity of O(N²), where N is the number of digits. Barely speaking, any complexity over O(logN) is difficult to scale on many-core platforms, which means that it is going to be the bottleneck of our computations despite increasing the number of computing units in our Cloud infrastructures. Any compiler optimizations, in addition to other performance techniques like tiling or caching strategies, are useless unless the quadratic complexity is softened. Is it possible to decrease the complexity of the classical multiplication operation? Yes, it is.

All is about Polynomials

Before explaining how to reduce the complexity of the multiplication operation, let me show you that any integer in base x can be decomposed into a polynomial coefficient vector. The resulting polynomial is described by a set of coefficients and the original integer is built by evaluating the polynomial in base x:

Image by the author

The number of coefficients is equal to the number of digits; that is, the size of the polynomial. This is called coefficient representation. Remember from your math lessons that the product of two polynomials results in a third polynomial of size 2N, and this process is called vector convolution.

Above I claimed that any integer with N digits can be decomposed in a polynomial of size N, and the multiplication of two polynomials of size N is another polynomial of size 2N. The next question is to know how we can perform the polynomial multiplication:

Image by the author

As I pointed on the image above, the natural way of convoluting two polynomials is to use the distributive property and multiply term by term and then reduce redundant terms by adding the coefficients with the same polynomial degree. This is the classical multiplication that I mentioned before, with O(N²) steps. However, look at the points that I marked on the image. Each point has coordinates (i, P(i)) where P(i) is the evaluation of P(x) for x=i. Any polynomial of degree d is defined by d+1 points on the plane and this is called value representation. The idea behind the FFT multiplication is to sample A(x) and B(x) for at least d+1 points, (x_i, A(x_i)) and (x_i, B(x_i)), and then simply multiply the function values one by one (pairwise product) in order to get the value representation of the two polynomials:

The value representation multiplication reduces significantly the number of performed operations compared to the coefficient representation multiplication. Hence, the idea is to change from the coefficient representation to the value representation, perform the multiplication in a pairwise fashion, and transform back the value representation to the coefficient representation. We need some more concepts before understanding how to do it.

The Discrete Fourier Transform for dummies

Well, after having refreshed our polynomial knowledge, we are now able to dig into the signals world. If you have never heard about the Fourier Transform, you just have to know that signals can be decomposed as a sum of other signals. While a time-domain graph shows how a signal amplitude changes along the time, frequency-domain graph shows how signals lie within each frequency band over a range of frequencies:

Image by author

The Fourier Transform decomposes any signal into a sum of complex sine and cosine waves. Although it can be applied to continuous and discrete waves, I am only focusing on the discrete waves in this article, aka the Discrete Fourier Transform (DFT). The DFT is the function that allows you to change from one domain to the other, and the FFT is an algorithm that computes the DFT very very efficiently, but we will talk about it later.

Image by the author

Now we have a global idea about polynomials and the DFT, so we can move forward with our multiplication problem. We said that:

Hence, the idea is to change from the coefficient representation to the value representation, perform the multiplication in a pairwise fashion, and transform back the value representation to the coefficient representation.

An intuitive way for understanding the FFT

Maybe you are thinking at this point that the DFT can help to transform from one representation to the other one. You are right, let’s see why. We said that we would need to sample at least d+1 points to have the polynomial values that accurately define the polynomial. There are some tricks that may help to reduce the number of sample evaluations even more:

Image by the author

If we are able to apply some polynomial properties, we can reduce the number of sample evaluations we need. Keep in mind that in an even polynomial, it is true that P(-x)=P(x), which means it is symmetric with respect to the y-axis. Let’s decompose the polynomial P(x) into even polynomials: P(x)=P_e(x) + x*P_o(x):

Image by the author

In such a case, the polynomial evaluation of the point -k reuses the polynomial evaluation of k:

Image by the author

This means that we can evaluate n/2 points instead (the complementing n/2 points are instantaneity obtained), which reduces the number of operations even more. At this point, you are likely thinking in a recursion procedure, and you are right. We can build this process intuitively as follows. For the given polynomial P(x) and n points (in fact n/2 paired points), we transform P(x)=P_e(x²)+x*P_o(x²) and evaluate recursively the P_e polynomial over the n/2 positive points on the one hand; and also evaluate recursively P_o over the same n/2 positive points. After getting their result, it is easy to get the P(x) evaluation on the remaining n/2 negative points:

Image by the author

The next question to answer is what n points we choose to evaluate our polynomial. Can we select any set of pair numbers? No, we cannot:

  1. We are passing half of the points to the next recursive level. The algorithm assumes that the polynomial will have positive and negative paired points for evaluation on every recursion call. However, each point is squared and ends up being positive. Is there any number that keeps the sign even when squaring it? The answer is the complex number i.
  2. In addition to this, each recursive level relies on the fact that the second half of the received points are equal to the first half but with the opposite sign. This ends up being an equation system, where the bottom level of recursion will have a single point that values 1. There are different ways of solving this, but the most immediate is to use the Nth root of unity (find the complex numbers that yield 1 when raised to some positive integer power N).

Imagine that our n is equal to 12, solving the 12th root of unity gives the following points to evaluate:

Source askIITians

Note that we are going to have power-of-two levels of recursion, thus it is convenient to find n as the smallest power of two greater than d+1, even though we only evaluate n/2 of them. The good news, we have just concluded with the most important algorithm of signal processing! This algorithm is known as Fast Fourier Transform.

Let’s put it all together into a pseudo-code:

Reducible Youtube Channel

Thanks to the FFT, we have obtained the value representation for each polynomial applying the Discrete Fourier Transform, and this is done in just O(logN) complexity. To multiply two polynomials in value representation, we just do a pairwise multiplication of the function evaluations on each point, where pairwise multiplication means multiplying the vectors in pairs, element by element, which is very cheap computationally speaking.

The Strassen FFT algorithm for multiplying large integers

This algorithm was invented by Strassen and Schönhage in 1971, but at this point of the article, you will be able to understand it easily.

If we want to multiply two large integers A and B of size N, we first transform them into their polynomial coefficient representation on base x. We store the resulting coefficients into vectors a and b, respectively. According to the convolution theorem, if c is the convolution of two input vectors a and b, c = a·b, then the Discrete Fourier Transform (DFT) of c is equal to the pairwise multiplication of the DFT transform of each input vector, DFT(c) = DFT(a)DFT(b).

Thus, the c vector can be also obtained as the Inverse Discrete Fourier Transform (IDFT) of this pairwise multiplication, c = IDFT(DFT(a)DFT(b)).

If the input vectors do not have the same length, M < N, the shortest one is filled with zeros until M = N. Finally, the coefficients have to be normalized to the same base as the one in which the integer is represented, and this step is commonly known as carry propagation. This is important since each element of c will host, after the inverse FFT, an integer larger than the base x because of the nature of the pairwise multiplication. The following image shows an example of the carry propagation method, where elements finally keep in base 10.

Image by the author

Computing approaches

When implementing the Strassen algorithm, most of the existing libraries do not use the FFT in the complex field. In the complex field, the integer coefficients are transformed into complex numbers. This ends up with values like 14.999878 instead of 15 after performing the inverse FFT operation. In contrast, most implementations use the finite field Z=pZ, with prime p. In this case, it is desirable to use a p number that minimizes the latency of the modulo operation and Fermat prime numbers are chosen for this end in most cases.

Nevertheless, there are several important restrictions with the finite field, in addition to find the n-th root of unity:

- The maximum value must fit in the field, that is, (n/2)(x-1)² <p.
- Multiplying in Z=pZ must be modulo p, thus the existence of a fast modulo p operator is desirable (like the Montgomery reduction algorithm).

Traditionally, the implementations on the finite field have been less efficient than the implementation on the complex field, as different modulo operations are required. In this recent paper, we have demonstrated that the Strassen algorithm with FFTs working on the complex field runs better than other approaches, and it is well-suitable for GPU architectures. We also demonstrate mathematically that it is safe to use it (despite working on complex numbers) for integers with less than 500K digits.

The important topic to understand here is that although classical multiplication can perform well for small sizes, the complexity never lies. When increasing the number of digits, performance drops despite any optimizations like tiling because of its O(N²). Instead, the Strassen algorithm, with O(Nlog N (log log N)) when counting all operations involved, it is much faster. The following graph plots the performance of Real/Complex FFT Strassen implementations (by two different libraries ID/CuFFT) vs the tiling-based implementation of the classical algorithm:

Image by the author

Adrian Perez researches at the Berkeley Lab and has a PhD in Parallel Algorithms for Supercomputing. You can check out more about his Spanish and English content in his Medium profile.

Bibliography

Dieguez et al. Efficient high-precision integer multiplication on the GPU. International Journal of High Performance Computing Applications (2022).

--

--

PhD in Parallel algorithms, Data distribution and GPUs. Researcher at Berkeley Lab, California