1

In 1966, Lander and Parkin published a paper containing exactly two sentences . They reported that they had used a program that used direct search on a CDC 6600 to obtain one counterexample to Euler’s Sum Of Powers Conjecture. The result:

27^4 + 84^4 +110^4 +133^4 =144^4


A small program, written in Fortran and using OpenMP, reproduces this result (and others of a similar nature) in two minutes on a current PC (OpenMP, 8 threads). 
I am curious to know if anyone can estimate how long the CDC 6600 (which cost about $10 million in 1966, equivalent to about $ 70 million this year) would have run before it produced the reported result.

I am grateful to John Nichols for bringing the paper to my attention in the Intel Fortran Forum a few years ago.

Some details on the search algorithm used in the 1966 paper may be seen in another paper, https://www.ams.org/journals/mcom/1967-21-097/S0025-5718-1967-0220669-3/S0025-5718-1967-0220669-3.pdf.

4 minutes seems like a lot of time.

I just tested this with:

! sum_conjecture.f90
!
! 27^5 + 84^5 + 110^5 + 133^5 = 144^5
!
use, intrinsic :: iso_fortran_env, only: int64
implicit none
integer(int64) :: i,j,k,l,n

outer: do i = 1_int64, 10000_int64
  do j = 1_int64, i
    do k = 1_int64, i
      do l = 1_int64, i
        do n = 1_int64, i
          if (j**5 + k**5 + l**5 + n**5 == i**5) then
            print *, "i^5 = ", i**5
            print *, "j^5 + k^5 + l^5 + n^5 = ", j**5 + k**5 + l**5 + n**5
            print *, j,k,l,n,i
            exit outer
          end if
        end do
      end do
    end do
  end do
end do outer

end

On an Apple M2 Pro I get:

$ time ./a.out
 i^5 =           61917364224
 j^5 + k^5 + l^5 + n^5 =           61917364224
                   27                   84                  110                  133                  144

real	0m5.956s
user	0m5.873s
sys	0m0.060s

3

0.4 seconds, but I made the loops hyper-triangular.

4

0.2 seconds if you precompute all integer fifth-powers.

5

It is an interesting example since it is dominated by floating point multiplication, with a relatively small number of integer adds, and minimal memory references.

A floating point multiply functional unit on the 6600 took 10 cycles to complete. It wasn’t pipelined. There were actually two FP multiply units. So with a clock speed of 10 mhz, it could theoretically do ~2 million multiplies/sec. The 6600 allowed other instructions in other functional units to operate in parallel via its “scoreboard” logic. The compiler could schedule integer adds and memory loads/stores into the gaps while the multiplier units were busy.

In the later CDC 7600, Mr Cray ditched the scoreboard and went to fully pipelined functional units (except floating divide). So a new floating point multiply could be issued every clock period. Clock speed was faster (36 mhz) as well. But I’d guess the bottleneck would have moved to the eight 60-bit X registers.

Both the 6600 and 7600 did integer multiplies via floating point multiply units. Integer adds were done via a separate integer add functional unit.

Do you know how many clock cycles an integer multiply and a floating point multiply took on these machines? The integer instruction supposedly could skip the exponent operations.

BTW, the inner loop in that code really only does one n**5 operation, one addition, and one comparison. The compiler should move all the other stuff to the outside of the inner loop.

7

Timing is identical for integer vs FP multiply. It is really a FP instruction either way - just unrounded vs rounded variants, and of course, zeros in the upper 12 bits for sign and exponent. There is also a ‘double precision’ version of the multiply - which gives the upper 48 bits of a 96 bit result.

8

It’s been a long time since I’ve thought about these details… The DP product instruction returns the lower 48 bits…

Perusing my copy of the classic “Assembly Language Programming”, by Ralph Grishman, he reminds us that the earliest versions of the 6600 did not have the ability to directly multiply integers via the FP multiply. So one had to use the Pack instruction to convert the integers to FP, then the DP product instruction, then convert back to integer via the Unpack instruction.

CDC fairly quickly realized they could make a small change to the DP product instruction to recognize that “if an integer smaller than 2^48 is used as an operand to a floating point instruction, it is treated as a floating point number with a biased exponent of 0 and hence a true exponent of -1777(base 8).” Before the change was made, “if both operands were integers less than 2^48 the machine compute a true exponent of -3776(base 8). Since a floating point number that small can not be represented in 6600 floating point format, the instruction would return a zero result.”

The hardware change was made to most of the earliest machines.

9

I was going to say, 5s is way too long. 0.4s seems about right for this kind of problem. :slight_smile:

10

On my AMD Ryzen 5 5600x running Linux Mint 21.3 I get the following times for Ivan’s code using AMD AOCC5.0 flang, nvfortran 25.5, and ifx 2025.1.1

flang - 8.122 secs
nvfortran - 8.09 secs
ifx - 42.68 secs

using the Linux time function. Since both AMD and Nvidia are based on classic flang I’m not surprised they give the same times. ifx though is a puzzle. All runs were compiled with -O3 optimization

Good idea. That gets the inner loop down to an integer addition and a couple of array lookups.

program sum_conjecture
   ! sum_conjecture.f90
   !
   ! 27^5 + 84^5 + 110^5 + 133^5 = 144^5
   !
   use, intrinsic :: iso_fortran_env, only: int64
   implicit none
   integer(int64), parameter :: nmax = 10000_int64
   integer(int64) :: i, j, k, l, n, sk, sl, sn
   integer(int64) :: n5(nmax)
   character(*), parameter :: cfmt = '(*(g0,1x))'

   outer: do i = 1_int64, nmax
      n5(i) = i**5
      do j = 1_int64, i
         do k = 1_int64, j
            sk = n5(j) + n5(k)
            do l = 1_int64, k
               sl = sk + n5(l)
               do n = 1_int64, l
                  sn = sl + n5(n)
                  if ( sn == n5(i) ) then
                     print cfmt, "i^5 = ", n5(i)
                     print cfmt, j, k, l, n, i
                     exit outer
                  end if
               end do
            end do
         end do
      end do
   end do outer

end program sum_conjecture

$ gfortran -O3 sum_conjecture.f90 
$ time a.out
i^5 =  61917364224
133 110 84 27 144

real    0m0.188s
user    0m0.174s
sys     0m0.007s

The timings are with an Apple M1 with MacOS.

12

it’s also possible to compute the integer fifth-powers at compile time.

integer(int64), parameter :: n5(nmax) = int([(i,i=1_int64,nmax)],int64)**5_int64

it’s not faster for me, but it gives overflow warnings upfront.

I don’t think this warning applies here because the array is small, but the danger of this approach in general is that the executable file must contain that compile-time computed data, which means that it takes time to load that memory from disk (SSD, etc.) into the process memory on startup. In contrast, if that memory is allocated at run time directly by the executable, and then filled by cpu instructions, then it can be some 10^3 to 10^5 times faster. You can also see the difference by looking at the size of the executable image ( ls -l a.out ). With most modern compilers, you do not see the size of the declared array reflected in the executable size unless they are parameter arrays, arrays initialized at compile time, or arrays in common blocks.

Also, if done at compile time, the whole array would need to be computed. That is 10^4 elements in this case (and integer overflows would be generated in doing so). The above run time code only computes 144 elements of the array before finding a solution and stopping.

A further question is where does that extra effort get charged? This extra effort is appropriate if the user is paying for that time (e.g. with money, or with elapsed time, or with total throughput through the machine), but not if that overhead is somehow not charged as user time (e.g. in a timeshare or batch environment with many other users). This is why the programmer sometimes writes code to minimize wall time and sometimes to minimize cpu time. Those two goals are not always exactly the same.

In the original code, the posix time command was used for the timings. That command returns three different time values for exactly this reason. If you alone own the machine you are running on, and you want to maximize throughput through that machine every 24 hour period, then it is the elapsed time that is critical. If you are one of many users sharing the machine, then it is the user time that you want to minimize, the system time is not charged to you because while your job is stalled waiting for the disk to respond, someone else’s job is swapped in and is being charged to execute its user time.

Here is the timing result of that last version of the code using gfortran -O3 with an Apple M2 cpu
on MacOS. It computes the i**5 values at run time, so the system time is minimal.

i^5 =  61917364224
133 110 84 27 144

real    0m0.141s
user    0m0.139s
sys     0m0.002s

One other comment about timings is that they are almost never really consistent from run to run. For small segments of code like this, one can do

$ time a.out; time a.out; time a.out

The first results are usually longer than the others, which are then usually more consistent if not identical at the millisecond level. But if timings are measured at the microsecond level, they would show variations too.

I think int64 just doesn’t have enough range.

> nagfor test_int64.f90
NAG Fortran Compiler Release 7.2(Shin-Urayasu) Build 7203
Warning: test_int64.f90, line 8: Unused PARAMETER N5
Error: test_int64.f90, line 8: Integer overflow for exponentiation 6209**5
Errors in declarations, no further processing for TEST
[NAG Fortran Compiler error termination, 1 error, 1 warning]
Click for test_int64.f90
program test
use, intrinsic :: iso_fortran_env, only: int64
implicit none

integer(int64) :: i

integer(int64), parameter :: nmax = 10000_int64
integer(int64), parameter :: n5(nmax) = int([(i,i=1_int64,nmax)],int64)**5_int64

end program

Yes, i=6208 is the largest integer for which i**5 can be computed without overflow, so nmax=10000 is overkill. But imagine a situation with an array say 10^8 or 10^9 in size, and the choice is to compute it at compile time or at run time. Then you can see how the differences in executable size and the differences in compile time vs. run time efficiency become relevant.

16

@RonShepard , Here is a shortened version of your code, guided by the motto “calculate just in time” . This version is short enough, fast enough and the aim of the program rich enough in historical interest, that it should be made part of the TIOBE benchmark suite. Does anyone know the process for submission?

```
!
   ! 27^5 + 84^5 + 110^5 + 133^5 = 144^5
   !
   program euler
   use, intrinsic :: iso_fortran_env, only: int64
   implicit none
   integer, parameter :: nmax = 150
   integer(int64) :: i,j, k,l,m
   integer(int64) :: n5(nmax)
   character(*), parameter :: cfmt = '(*(g0,1x))'

   outer: do i = 1, nmax
      n5(i) = i**5
      do j = 1, i
         do k = 1, j
            do l = 1, k
               do m = 1, l
                  if ( n5(j)+n5(k)+n5(l)+n5(m) == n5(i) ) then
                     print cfmt, "i^5 = ", n5(i)
                     print cfmt, j, k, l, m, i
                     exit outer
                  end if
               end do
            end do
         end do
      end do
   end do outer

end program euler
```
When compiled with the Intel Ifort compiler and run on a Windows PC with Ryzen 7-6800U, this program ran in 0.3 s.

17

A further speed-up may be achieved by noting that mod(the fifth power of an integer,11) has to be in the set {0,1,10}

Is there a reason why m is an integer and not an int64? On Windows with Intel i7-528K, I got 0.375 s (gfortran) and 0.30 s (Python + Numba).

19

That 0.3 s for Python is impressive. Please consider posting your code!

One would hope that the compiler would automatically move the additions outside of the inner loops. However, one additional reason for manually moving them outside is that you can exit some of the loops early.

         do k = 1, j
            sk = n5(j)+n5(k)
            if ( sk > n5(i) ) exit
            do l = 1, k
               sl = sk +n5(l)
               if ( sl > n5(i) ) exit
               do m = 1, l
                  sm = sl +n5(m)
                  if ( sm > n5(i) ) exit

I doubt that a compiler optimizer would do those early loop exits.

I notice that if you have a solution to the equation j^5+k^5+l^5+m^5=i^5, then if you multiply the integers by some common integer factor x, then (x j)^5+(x k)^5+(x l)^5+(x m)^5=(x*i)^5 is also a solution. So when one violation of the conjecture is found, that automatically means that an infinite number have been found. I was curious if there are any other violations of the conjecture in addition to the multiples of the (133,110,85,27,144) sequence, so I removed the exit outer statement from the program and let it run for up to i=1728 , and those multiples are the only ones found. I expect that mathematicians who are familiar with this problem already know if there are other violations, but it was a simple thing to do, so I did it. That code ran several hours overnight. I wonder if there are algorithms to do this search in a more efficient, maybe less brute force, way? For example, can you remove the outer i loop, do the other four loops, and then efficiently search the n5(:) array for matches to sm ? If you could do that, it would reduce the effort from O(N^5) to O(N^4). @mecej4 mentioned using modular arithmetic, maybe that could be used to limit some of the possibilities early in the outer loops? Of course, all these things will make the code more complicated.