Adding Crossover to CMA-ES


In GCMA-ES we introduced a new optimization algorithm and applied it to different real world problems Around the World in Eighty Days and GTOC1 today . Not only are these problems related to a real world tasks - space flight planning - we validated the optimization results by using them as basis for further optimization using a more detailed model of reality: continous low thrust. Using a genetic crossover operation in the context of a CMA-ES retry mechanism feels quite natural so we are probably not the first ones exploring this idea. But the devil is in the details:
  • How to define the crossover operation 
  • How to select ancestors for crossover 
  • How to parameterize the CMA-ES runs 
Here we will present more details together with some thoughts about metaparameter optimization. We are aware that the notion G-CMA-ES exists in some early CMA-ES papers in a different context, but we think nevertheless Genetic CMA-ES (GCMA-ES) is a good name for this retry approach. Please comment if you disagree and want to propose a different name or if you know about other interesting real world optimization problems we should apply GCMA-ES.

Introduction

Constraint real parameter optimization aims at finding the assignment of real valued decision variables that generate the optimal real valued solution for the given problem. Normally this means finding an argument vector X which minimizes a function value f(X) for given lower and upper bounds of the decision variables xi . Often the solution space is full of local minima, so the challenge for an optimization algorithm is to avoid getting stuck in these local minima and being able to find the global optimum.

Back in 2005 in the context of the IEEE Congress on Evolutionary Computation the first time a competition using a consolidated benchmark was performed, with CMA-ES as the clear winner.

Team Jena, the only team participating in the GTOC competition using CMA-ES as its main optimization method gathered a lot of experience applying this algorithm. We never investigated artificial test functions, but know quite well its limitations for practical applications in the context of space flight planning. There are real world problems, like the messenger full problem GTOP Task Messenger Full , which no existing algorithm including CMA-ES can solve in reasonable time using a single processor.

Usually in this situation instead of optimization, some variant of beam search is applied which exploits the linear structure of a space trajectory consisting of multiple legs. For trajectories involving rendezvous maneuvers this works well, since each leg can be independently optimized. For flyby trajectories each leg is dependent on its neighbors, and transitively on all other legs of the trajectory. Extending the capabilities of optimization means, that for these trajectories search algorithms can be replaced by optimization, which has three main advantages:

  • Optimization is performed simultaneously on all legs. 
  • Reduced memory consumption. 
  • No need to find a suitable search heuristics. 
This was the main motivation experimenting with a CMA-ES retry mechanism which resulted in genetic CMA-ES (GCMA-ES)

Genetic CMA-ES

GCMA-ES is a retry mechanism for CMA-ES which can be viewed as a genetic algorithm with CMA-ES runs as its population members.

Two CMA-ES configuration parameters are important for GCMA-ES:

  • The initial guess, the initial assignment of the decision variables. 
  • The sigma value / initial mutation rate of the decision variables. 
The crossover operation determines these values for its descendant.

GCMA-ES population

A member of the GCMA-ES population is “born” by performing a CMA-ES run:
  • Either using a random initial guess inside the given boundaries and a fixed sigma value for all decision variables.
  • Or an initial seed vector and sigma values + boundaries separately defined for all decision variables derived by crossover of two GCMA-ES population members. 

Global history of intermediate solutions

A global history of the intermediate solutions of all CMA-ES runs inside the GCMA-ES population is maintained:
  • After 1000, 2000, 3000, …. function evaluations the temporary solution values are stored 
  • A CMA-ES run is stopped and removed from the population as soon as its temporary solution is ranked lower as a defined “breadth” threshold b compared to all history values after the same number of function evaluations. We used b = 1000 for messenger full. 
  • If a CMA-ES run is removed from the population its history is deleted. 
  • There is a global limit for the number of function evaluations – we used 20000 evaluations - which is increased each 1000 CMA-ES runs. 
There are three conditions a CMA-ES run terminates:
  1. Removal because of low rank compared to the history of previous runs. 
  2. The function evaluation limit is reached. 
  3. Any other CMA-ES termination condition is fulfilled. 
In case of 2. and 3. a configured threshold value together with a diversity check decides if the member is removed or finally be added to the GCMA-ES population. Each population member stores the resulting function value – the local minimum computed by CMA-ES – together with the corresponding input vector.

Diversity check

Before a retry is added to the population we check for similarity with existing members using the euclidean distance of the two solutions normalized to the interval [0,1] using the boundaries. Only members ranked one above or below the new retry are used for the comparison. If the euclidian distance of the normalized solutions is below a threshold, only the better solution of the two will survive in the population. This avoids accumulating equivalent solutions which would reduce diversification. The distance threshold is a configuration parameter of GCMA-ES.

Crossover

CMA-ES uses covariance adaption to guide the mutation of its decision vector population. GCMA-ES adds a crossover operation by comparing the computed solution vectors of two previous CMA-ES runs.
  • The two ancestors are chosen randomly, but better valued members have a higher chance being chosen. 
  • As initial guess the higher valued of the two is used. 
  • The sigma value for each decision variable is derived from the difference of this variable value in the two ancestors. We used a quadratic relation. If for instance the interval for a decision variable is [0.0, 1.0], the two ancestor values are {0.3, 0,4} we use sigma = c*((0.4 – 0.3) / (1.0 – 0.0))² for some constant factor c for this decision variable. 
  • The boundaries for each decision variable is derived from the difference of this variable value in the two ancestors. 
The idea behind the sigma configuration is that some variables are seen as already “settled” where others are free to mutate. We tried different approaches defining sigma and the initial guess, the proposed one worked best for messenger full. For both sigma values and boundaries we use stochastic values derived from the decision variable differences and some configuration parameters. 

Parallelization / NUMA configuration

We execute CMA-ES single threaded, but each GCMA-ES execution uses eight parallel CMA-ES runs. We execute eight GCMA-ES runs in parallel on a 32 core / 64 threads CPU (AMD 2990WX), but only one on a 4 core / 8 threads laptop. 

We found this way the CPU is optimally utilized – which is not the case if we use parallelization for a single CMA-ES run. Note, that consumer processors like the AMD 2990WX have limited NUMA-Node to memory connectivity, so the numactl command should be used to minimize the memory access overhead. Server processors like Epyc don’t have this problem since they support eight memory channels – as long as at least 8 memory slots are populated. See GCMA-ES section "How to fully utilize the 32 Cores of the 2990WX" for more details.

Meta Parameter Optimization

Automated meta parameter optimization for GCMA-ES would require much more CPU resources as is available to us, so we tried to tune them manually using a very limited number of problems. Future will show whether these parameters work well with other problems. Important meta parameters are:
  • Initial maximum number of function evaluations for CMA-ES. 
  • Increment of the maximum number of function evaluations. 
  • “breadth” threshold to eliminate CMA-ES runs using the stored history of other runs. 
  • Number of "checkpoints" for the history based elimination. 
  • Distance threshold for the diversity check.
  • Threshold function value for inclusion into the population.
  • Rate crossover is performed compared to random CMA-ES runs. 
  • lambda - initial population rate - used for the CMA-ES runs. 
  • Parameters which determine the sigma values and boundaries derived from the decision variable differences used for crossover. 
  • Parameters guiding the stochastic selection of crossover ancestors. 
Specially the last two sets of parameters are crucial for the overall performance. We observed that for Messenger Full we can easily miss the <= 2.0 results choosing inferior parameters. Compared to the results presented in GCMA-ES we could improve the results for the Messenger Full problem.  

Example: Messenger Full

The messenger full problem GTOP Task Messenger Full is part of the GTOP set of trajectory benchmark problems. Its best known solution models quite well the real messenger mission. But its original implementation based on ESA's AstroToolbox is seriously flawed. Reason is that it uses an old implementation of the Lambert algorithm which increases the “roughness” of the solution landscape. This way it is much harder to “escape” local minima. By replacing the Lambert algorithm implementation with a newer one PYKEP Lambert Solver, this problem can easily be fixed, but we loose comparability with previous attempts to solve the problem like MIDACO Messenger . For this reason we show results for both the original and the “fixed” version of the messenger full problem.

See Messenger Video a video of the best solution found using the new lambert solver.





The two diagrams above illustrate 48 GCMA-ES runs using the original problem (above) and the new lambert solver (below). The best values achieved for each run for a specific computation time is shown. Nearly 1E9 evaluations per hour are performed in a single run. An AMD TR2990WX processor performs 8 runs in parallel which means about 7E9 evaluations per hour.





These two diagrams shows the same information but lists the 
number of evaluations. Conversion to lower values is much faster using the new lambert solver, resistance at local minima is lower and a better optimal value 1.925 is reached after a few hours in one of the 48 runs. 





These diagrams "zoom in" to show the differences in detail. 

The following diagram shows the number of solutions < 2.0 found using 60 * 1E9 evaluations. On the left side the original benchmark is shown, on the right side the new version. 21 solutions on average for the original version, about 196 for the new version.



A TR2990WX processor, configured to perform 8 optimizations in parallel can compute about 8 * 21 / 7h = 24 solutions per hour with value < 2.0 for the original problem and 8 * 196 / 7h = 220 solutions per hour with value < 2.0 for the "fixed" problem. 

Next we show the same information but for solutions < 1.956, nearly optimal solutions for the original Messenger Full problem. We used 60 * 1E9 evaluations because this is roughly the number of evaluations used in the 10 test runs shown in MIDACO Messenger . Not one of these runs reached a value < 2.0.


Messenger Meta Parameter Optimization

Lets try to increase the maximum evaluation number increment from 2000 to 10000 every 1000 runs. We keep the initial maximum evaluation number at 20000. Here is value reached after n function evaluations for the original messenger full problem:
Next the same information using the new Lambert solver
At a first glance it looks like a regression, specially with the new Lambert solver. "Resistance" around 2.4 has increased. Nevertheless we prefer the new setting, we have to look at different diagrams to understand why. We list the probability distribution for different results after 3E9 evaluations for the original problem using the old setting (max eval increment = 2000)
We reach the optimal solution 1.958 in 8.7% of the runs using 3E9 function evaluations. Now for the new setting (max eval increment = 10000)
We now reach the optimal solution 1.958 in 44% of the runs using 3E9 function evaluations. Next is the probability distribution for different results after 2E9 evaluations for the fixed problem (new Lambert solver) using the old setting (max eval increment = 2000):
And now using the new setting (max eval increment = 10000):
The new setting increases the diversity of the results. It produces some results around 2.4, much worse than the worst result using the old setting. But on the other hand this diversity also helps finding better solutions, the best value improved from 1.927 to 1.925. If our goal is to find the optimal solution, diversity provides a significant advantage. 







Comments

Popular posts from this blog

How to use the AMD Threadripper 2990 WX for Scientific Computations

A New Challenge: Saving Mark Whitney