How to use the AMD Threadripper 2990 WX for Scientific Computations

The AMD Threadripper 2990 WX is a processor suitable for scientific computations but since only two of its four NUMA nodes have direct memory access it is difficult to exploit its power completely.

Its main advantages are

price / thread - about 1800/64 = 28.125 $/thread
power consumption / thread - 250/64 = 3.9 W/thread up to 350/64 = 5,4 W/thread overclocked

These numbers look much worse if we don't utilize all 64 parallel threads, so we have to find a way to use them all thereby minimizing the penalty caused by the NUMA architecture of the 2990 WX.
Overclocking / Power Efficiency

Since power consumption grows super-linear with clock speed only limited overclocking (<= 3.8Ghz all cores) makes sense. In my experience using a fixed processor speed increases stability for specific VCore (I used VCore=1.25V for 3.8GHZ), but limits single core performance. Power consumption relative to performance is best at stock speed but can further be optimized lowering VCore (undervolting).

Operating System Used

In the scientific context the Linux OS has many advantages:

open source libraries are much easier to compile and to use
support for NUMA node assignment

so I will focus on Linux here. I used Linux Mint 19 for my experiments.

NUMA Node Allocation

The best way to utilize all 64 threads in a scientific context on the 2990 WX is to run four experiments/tasks in parallel, each using 16 threads - one NUMA node. You have to install numactl to assign NUMA nodes to tasks. Example shell script:

numactl -l -N0 ant Messenger | tee mess1.txt &
numactl -l -N2 ant Messenger | tee mess2.txt &
numactl -l ant Messenger | tee mess3.txt &
numactl -l ant Messenger | tee mess4.txt &

The script assigns two tasks to the NUMA nodes with direct memory connection and starts two tasks without specific node assignments. N1 and N3 cannot be explicitely assigned, numactl complains about the missing direct memory connection. "tee" is used to see the output of all 4 tasks and at the same time routing their output into separate files.

"ant Messenger" in this case is a command executing the task, which here is Java code calling C++ libraries via JNI. In Java it is easy to parallelize tasks, the code uses exactly 16 threads. "ant" is used because you can export "Run tasks" from eclipse as ant build.xml files to call them from the command line.

Scientific Example Task

For testing our NUMA allocation scheme we choose a concrete scientific task related to continuous optimization and space mission design. GTOP Task Messenger Full is a simplified task related to the Messenger Mission which was finished end of 2009 with orbit insertion at Mercury. Although the task is simplified the best results correspond very well with the real mission.

GTOP Task Messenger Full shows that it took over 5 years to find the best solution. messenger_full is one of the hardest solvable continuous optimization tasks.

The best solution found so far is described in MIDACO Messenger based on a parallel implementation of the commercial MIDACO solver MIDACO Solver . The paper compares its results with an open source algorithm called CMA-ES showing the superiority of the parallel MIDACO framework.

The author of this blog contributed the original Java implementation of CMA-ES to apache.commons.math Apache CMAES but there are many other open source implementations of this algorithm available for many programming languages.

We want to answer the following questions:

Is messenger_full really solved? A strong indication would be if the same solution could be found using a different algorithm.
Is it possible to solve messenger_full using only open source algorithms?

The AMD TR 2990 WX could answer both questions in a few days. We could verify the bad results applying CMA-ES directly from MIDACO Messenger . So there are two options: Use another algorithm or find another way to apply CMA-ES. Because of the authors familiarity with CMA-ES we opted for the latter approach.

Normally a mission designer would not solve a whole mission at once, but first compute partial solutions later to be extended to a full solution. A whole search tree is built were bad branches are truncated early (beam search). Since the order of the visited planets is fixed here, a simpler approach can be used for messenger_full.

Messengers flight path is Earth - Venus - Venus - Mercury - Mercury - Mercury - Mercury (short EVVMMMM) performing gravity assist maneuvers at the intermediate flybys. Target of the optimization process is the sum of the delta velocities corresponding to the propellant needed for the mission in relation to the mass of the spacecraft.

We know that a good final EVVMMMM solution contains a good partial EVVMM solution, with the exception of the orbit insertion at Mercury, since we will continue the flight in the full mission. On the other hand, EVVMM orbit insertion dv should not be completely neglected, since the purpose of the mercury flybys is to reduce the energy of the spacecraft to prepare for the orbit insertion at the last Mercury encounter. Low orbit insertion dv at the second Mercury encounter indicates that we made progress here even if we continue our flight. In practice a weight factor of 0.25 has proven to be useful.

Our open source approach to messenger_full is:

1) In a loop using CMA-ES compute partial solutions for EVVMM, using weight 0.25 for the orbit insertion dv at the second Mercury encounter. If a overall dv <= 4.0 is reached, immediately stop the optimization and continue with the second step.

Use random values as initial "solution" to seed CMA-ES.
Use random limits for the start time to further "diversify" the partial solutions found.
Use low sigma values (sigma is the initial variance used during CMA-ES) to "bind" the partial solutions stronger to the intial random seeds to enhance diversification.
Use 16 parallel loops to utilize the 16 avaliable threads.

2) After a partial solution with dv <= 4.0 is found, in a second inner loop use this solution together with random values for the remaining parameters as initial seed for CMA-ES applied to the full EVVMMMM problem. Terminate the second loop after 3 up to 70 iterations depending on the best solution found in this loop.

3) After a few hours (we used <= 10h) store the overall best solution of all 16 threads, resulting in four solutions since we run 4 parallel loop tasks in parallel.

4) Repeat 1) -3) a few times (we run it 5 times)

5) Now take the best and the second best overall solution and compute its differences in all arguments. Most arguments are very similar, a few differ significantly. Divide this differences by the differences of the argument constraints. For example we got start time (first argument): best solution: 2038.003, second best: 2037.976, relative difference: abs(2038.003 - 2037.976) / (2200 - 1900) = 9E-5 . Use this relative differences for all arguments as individual sigma values for the final CMA-ES application. Restrict the boundary constraints around the best solution using the absolute differences of both solutions - we choose best[i] +/- 5*absDiff[i] as boundaries.

In 16 parallel loops apply repeatedly CMA-ES using the best solution as initial seed. After a few minutes we found our final solution which is equivalent to the MIDACO solution.

By choosing the relative differences for all arguments as individual sigma values we "encourage" CMA-ES to stick with the arguments we already "nailed", and vary the ones we are not so sure yet.

Our final value is 1.9583071 using the following deltaVs: 4.05, 0.0, 0.0, 0.6, 0.077, 0.202, 0.172, 0.907 obtained by the solution:

[2038.1193686237, 4.0499924531, 0.5566379454, 0.6340829328, 451.5726113892, 224.6946668268, 221.2742984147, 266.0661442424, 357.9563776967, 534.1104206749, 0.6101722357, 0.3618933084, 0.6889501273, 0.743987932, 0.829178598, 0.9027736659, 1.7177946901, 1.1000021716, 1.0500326518, 1.0500055178, 1.0506288522, 2.7686345257, 1.5713456289, 2.636684432, 1.6618006594, 1.5709086642]

Value for our best solution before 5) was 2.068, second best was 2.275. Most of the parameters were more or less "settled" in both solutions, here are their parameter values:

[2038.003, 4.05, 0.557, 0.634, 451.611, 224.695, 221.967, 263.913, 351.88, 449.964, 0.658, 0.155, 0.694, 0.657, 0.769, 0.868, 1.719, 1.1, 1.05, 1.314, 1.05, 2.768, 1.565, 2.598, 1.42, 2.886]
[2037.976, 4.05, 0.557, 0.634, 451.619, 224.694, 221.808, 263.923, 360.869, 522.043, 0.632, 0.466, 0.705, 0.65, 0.816, 0.875, 1.751, 1.1, 1.053, 1.21, 1.392, 2.763, 1.595, 2.587, 2.019, 1.519]

So we finally got a strong indication messenger_full is really solved since we obtained more or less the same solution as the MIDACO team using a completely different algorithm. Although the result value is almost the same, some of the parameters significantly deviate from the MIDACO solution. MIDACO is still the undisputed king of continuous optimization when we consider the problem as a black box. But this seems not the best idea in the context of mission design problems.
Takeaways

If you perform scientific experiments (simulation / search / optimization) and you use a processor like the AMD TR 2990 WX having NUMA nodes without direct memory access: Use "numactl -l -Nx" to assign tasks to the numa nodes with direct memory access until they are fully utilized. For the 2990 WX this means 2 tasks each with 16 threads, or 4 tasks each with 8 threads and so on. Then use numactl -l to assign tasks to the remaining NUMA nodes until the processor is fully utilized. Epyc processors and smaller AMD processors like the TR 2950 have a dedicated memory channel per NUMA node, as have Intel processors. But even then I recommend the "numactl -l -Nx" technique addressing each NUMA node separately. Expect a 15% slowdown at the NUMA nodes without memory access for memory intensive tasks. Not using numactl at all results in a similar slowdown for all tasks.
Never view a non trivial continuous optimization task as black box if there is a sub-task which can be solved independently. "Independently" should be loosely interpreted here. For example it is useful to solve a sub-task of the messenger mission design, even if most of these sub-solutions cannot be completed to a good full solution. Use heuristics to predict the final result (in our example the "0.25 * insertion dv at the 2nd Mercury encounter" heuristic). Stop the sub-task optimization as early as a certain "threshold" is reached. Further computing causes "over-fitting" - a term borrowed from machine learning - if the sub-task is not completely independent. You not only loose time, you risk to destroy the diversity of the generated sub-solutions. If the sub-task is completely independent, like for instance for rendezvous space mission design were each leg can be optimized separately, you don't need phase 3) and can invest more function evaluations in this phase. Use random seeds and additional parameter constraints to further increase the diversity of the sub solutions.
Use as many sub solutions as possible to seed the optimization of the full problem. Fill in random values for the remaining parameters not covered by the sub solutions. Don't invest too much function evaluations in this phase since we only aim for good solutions here, not the optimum.
Use CMA-ES for a final optimization phase: Compute the relative difference of two independently computed good solution vectors and use this as sigma parameter vector for repeated CMA-ES runs using one of the solutions as initialization seed. This way the optimization "focuses" on parameters which are not "settled" yet. This technique should only be applied when we are quite near the optimum.
Applying 2) - 4) results in a "state of the art" algorithm solving hard continuous optimization problems comparable to the best known approaches. It is not applicable to problems where no "more or less" independent subtask exist. CMA-ES is only required in phase 4, although we used it exclusively in the prior phases - without access to good alternatives. Requirement for the algorithm producing the solutions of the sub problem is that it is able to maintain diversity seeded with random initial parameters. CMA-ES achieves this by using a low sigma parameter value.
CMA-ES is very good at adapting internal parameters of the algorithm "on the fly" during the optimization process which enables its application to moderate problems without knowing how it works and what the meaning of parameters like sigma (initial variation vector) and lambda (population size) is. But be aware this is not the case for hard problems like messenger_full. You can solve messenger_full efficiently with CMA-ES, but not by viewing it as a "black box" algorithm.

Appendix

Values reached by the process after n hours. Each line represents a NUMA node executing 16 computation loops in parallel. The first run was stopped after 5 hours, the second after 9,5 hours.

Here is a distribution of the values reached by individual CMA-ES computations seeded by sub-solutions.

Peak is at value = 7, where both plain MIDACO and CMA-ES have a peak at value = 16. Values near value = 2 exist but are so seldom they a not visible in this diagram.

Search This Blog

How to use the AMD Threadripper for Scientific Computations