.. _genetic_automl-label:

Genetic Optimization in AutoML
-------------------------------
Currently the `auto_ml` module(inclusive of AutomaticClassification, AutomaticRegression and AutomaticTimeSeries) in `hana_ml.algorithms.pal`
package uses genetic algorithm(GA) for pipeline optimization, in which NSGA-II is used as selection algorithm to support multiple objectives.

In general, the evolution of genetic optimization is an iterative process, with the population in each iteration called a generation.
It usually starts from a population of randomly generated individuals, the fitness of every individual in the population is then evaluated
(the fitness is usually the value of the objective function in the optimization problem being solved). The better fitted individuals are
stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated)
to form a new generation. The new generation of candidates(individuals) is then used in the next iteration of the algorithm,
and it repeats until either the maximum number of generations has been reached for the current population,
or a satisfying fitness level has been reached.

Individual Representation
^^^^^^^^^^^^^^^^^^^^^^^^^
In the auto_ml module of hana_ml.algorithms.pal package, each individual is nothing but
a chain of operators that form a pipeline. In GA, any such individual is represented by a *latent* tree structure, illustrated as follows:

    .. image:: individual_tree.png
       :align: center
       :scale: 50 %
       :alt: Tree Structure in GA.


By specifying the maximum and minimum number of layers of the tree structure, GA can randomly generate individuals that fullfill the condition
specified by the given config dictionary.


Selection
^^^^^^^^^
Selection is a process whereby certain individuals from the current generation are selected as parents of the next generation.
Currently the selection algorithm is NSGA-II.

Crossover
^^^^^^^^^^
For two selected individuals, randomly exchange a pair of leaf nodes of equal type but different values,
or a pair of child nodes with different structures.

    .. image:: crossover.png
       :align: center
       :scale: 50 %
       :alt: Crossover

Mutation
^^^^^^^^
There are two types of mutation in AutoML:

    - Leaf Node Mutation: Randomly replaces one of the leaf nodes with other leaf nodes of the same type.

      .. image:: leaf_node_mutation.png
         :align: center
         :scale: 50 %
         :alt: Leaf Node Mutation
    - Child Node Mutation: Randomly replaces the entire subtree under the child node with a newly generated subtree
      that meets the validity requirements.

      .. image:: child_node_mutation.png
         :align: center
         :scale: 50 %
         :alt: Child Node Mutation

Evolutional Iteration Step
^^^^^^^^^^^^^^^^^^^^^^^^^^

In each evolutional iteration, AutoML randomly selects an operation from `crossover`, `mutation`, and (random) `regeneration` with user
input probability to generate a new individual and evaluate its fitness score. Then the selection algorithm is used to select
``population_size`` individuals from the population and offsprings as the next generation population.

Control Parameters
^^^^^^^^^^^^^^^^^^
Main control parameters of GA are listed as follows:

    - ``generations``: Specifies to the number of iterations in GA optimization. A larger value usually results in longer total training time,
      but in most cases, the iterative process will terminate when the ``early_stop`` condition is reached.
      The user can also stop the training process manually, and get the optimal solution from logging information.
    - ``population_size``: Specifies the number of individuals in each generation in GA optimization. If there are too few individuals,
      GA has few possibilities to perform crossover and only a small part of search space is explored; on the contrary,
      if there are too many individuals, GA may take too much time in generating not-so-promising individuals via crossover,
      which could be time/resource inefficient.
    - ``offspring_size``: Specifies the number of offsprings to produce in each generation. It controls of number of
      new individuals generated by genetic operations from population.
    - ``mutation_rate``: Specifies the mutation rate for GA optimization. The value of this parameter reflects the
      random(global) search ability of GA, and a suitable value may prevent the algorithms from falling into a local optimum.
    - ``crossover_rate``: Specifies the crossover rate for GA optimization. The value of this parameter reflects the local
      search ability of GA. If larger crossover rate is used, GA may converge faster, but only to local optimum.
    - ``min_layer`` : Specifies the minimum number of operators to form a valid pipeline.
    - ``max_layer`` : Specifies the maximum number of operators to form a valid pipeline.

    .. note::

        The summation of ``mutation_rate`` and ``crossover_rate`` cannot be bigger than 1. In this case, the value of \
        1-``mutation_rate``-``crossover_rate`` is the `reproduction rate`, which is the probability used in GA to (randomly)\
        regenerate new individuals(as offsprings) in each iteration.