Evolutionary Latent Space Exploration of Generative Adversarial Networks

Generative Adversarial Networks (GANs) are Machine Learning models capable of training generators to produce very realistic and high quality samples that closely follow the distribution of the input dataset.

A lot of research has been made to make GANs more reliable and able to produce better and more diverse images, although most of this research revolves around the shape of the models that compose the GANs, and their training.

However, the latent space of the generator model, which is the input for the generation of an image, is what, ultimately determines the output produced. It is unique to each model and hides hidden patterns and information that may be used to produce better samples. 

In this work, we propose an evolutionary approach to explore the latent space of a GAN by traversing it with predetermined criteria. The objective is to find sets of latent vectors that will produce sets of diverse images. In the end, we wish to compare how our approach faces against the traditional approach of randomly sampling latent vectors to produce new images.

Figure 1

The overview.

Our approach integrates a GAN and an Evolutionary engine to find sets of synthetic images that fit certain criteria. As such, it consists of two main steps

  1. Training of GAN with the original dataset
  2. Exploration of latent space using Evolutionary Algorithms


Generative Model (GAN)


A GAN is a junction of 2 models, a discriminator that distinguishes real from fake images and a generator that generates fake/synthetic images. In order to train them, they are placed against each other in a min-max game. The discriminator learns by telling the images of the original dataset apart from those created by the generator. The generator learns from the feedback of the discriminator on the generated images.


Figure 2

An overview of the Generative Adversarial Network.


For each image to be generated, the generator takes as input a vector from a high dimensional space called the latent space. As the training progresses, it learns to produce images that follow the distribution of the original dataset, codifying in the latent space the patterns to produce those images.


Experiments in Latent Space Exploration


The focus of the latent space exploration was to use Evolutionary Computation to build sets of images according to a certain objective which, in this case, was to maximize diversity. To achieve this we compared 3 different evolutionary algorithms as well as 2 distinct metrics for image similarity. The setup was thought with the objective of analysing how different algorithms and distinct metrics affect, for different situations, the end result in terms of diversity, which is the main goal. Besides, it would also be interesting to see if the diversity measured by an algorithm was consistent with our perceived diversity.


In all algorithms, each individual corresponds to a set of images. The images are coded into the genotype of the individual as a set of latent vectors. The algorithms start with the random generation of the initial population. At each step, the existing population is used to generate new individuals using variation operators (crossover and mutation). After such, each goes through the GAN to transform the latent vectors (genotype) into images (phenotype). Each individual is assigned a score after the corresponding set of images is evaluated according to a certain fitness function that uses a metric of similarity to ascertain diversity. The fitness function is the aspect most responsible for the effective evolution of the individuals. Likewise, the choice of the metric is heavily important since different metrics, which compare images differently, will end up steering the evolution in distinctive ways. The result is a new pool of individuals that come closer to the objective. The cycle is then repeated until a certain criterion is met.


Figure 3

The Evolutionary Exploration Cycle.


The 3 algorithms used in the experiment were:

  1. Random Sampling (RS) – At each step, a new single random individual is generated. It is compared with the existing one and the best is kept.
  2. Genetic Algorithm (GA) –  Works with a pool of individuals which maintains its size at each iteration. The new individuals are generated through uniform crossover and creep mutation using tournament for the selection of parents. At each step a new pool is generated with an elitism of 1.
  3. Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) – A growing archive of individuals is kept at all times. The phenotype of individuals is mapped into this archive using multiple feature dimensions. The archive is divided into cells. Each individual that is generated, both at the initialization and the evolution phases, is mapped into a single cell of the map and only the best is kept. Each step generates a single individual using 2 random parents from the archive performing uniform crossover and creep mutation. 


The fitness function used consisted on the average similarity where each image was compared to every other image in the set using a similarity metric. The fitness corresponded to the average of all values. Since we were looking into maximizing diversity, the objective was to minimize similarity.


We compared 2 similarity metrics, which were selected due to their fast calculation time and to observe the impact of both since they work in different ways

The 2 similarity metrics used were:

  1. Root Mean Squared Error (RMSE) – strict and direct pixel by pixel comparison
  2. Normalized Cross-Correlation (NCC) – looks for certain contrast in the pixel intensities 

And the experiments were performed with 3 different datasets:

  1. MNIST – handwritten digits – 60000 samples (28 x 28 x 1)
  2. Fashion-MNIST – clothing – 60000 samples (28 x 28 x 1)
  3. Facity – faces – 4204 samples (128 x 128 x 3)
Figure 4

Samples from each original dataset.


Results in Latent Space Exploration


Regarding the effectiveness of the algorithms, the RS and the MAP-Elites, did not provide much improvement over the initial population and the results were very similar between them both in terms of objective fitness and visuals. However, the GA proved to be much more efficient at optimizing the fitness function and that difference is noticeable in the resulting sets. It’s clear that the sets created by the GA are much more diverse.

On another hand, analysing the similarity metrics, we can see the particularity of each similarity metric. The RMSE tended to pick a sets of generated images which minimize the overlap of elements and dissimilar background (facity). The NCC, however, tended to promote contrast between the different images. This is a clear example of the success of the approach of searching and achieving the predetermined objective.


Figure 5

The visual results using the RMSE as the similarity metric for all algorithms.

Figure 6

The visual results using the NCC as the similarity metric for all algorithms.

The attained results led us to use this approach in another experiment which uses GANs to perform data augmentation on datasets in order to improve the performance of classifiers.


Experiments in Classifier Improvement


We attest the viability and potential of the framework for real-world problems. The framework employs a supervisor module that uses an Evolutionary Computation approach to evolve sets of images drawn from Generative Adversarial Networks’ latent space. The fitness function is based on the dissimilarity of the subsets generated by the Generative Adversarial Networks. This module handles the generated samples and chooses which set should be added to the training dataset. To test the framework, we explore the Human Sperm Head Morphology dataset, a bio-medicine multi-class problem with a small number of samples that provide a challenge to the different supervised classification approaches.

Figure 7

Overview of approach with the inclusion of a Supervisor module.




In order to test our hypothesis, we performed the experiments on the Human Sperm Head Morphology dataset (HuSHeM). In the bio-medicine context, Sperm morphology analysis is a critical factor in the diagnosis process of male infertility. The dataset is divided into 4 classes of sperm heads images: Normal (54 instances), Tapered (53 instances), Pyriform (57 instances) and Amorphous (52 instances) for a total of 216 images. A small dataset like this one is an opportunity to explore Data Augmentation approaches. The dataset has no sub-division, as such, it was decided that we would use 40 instances of each class for training and cross-validation, leaving the remaining images for testing. Although the original images are in RGB, we decided to perform the first tests using the images in grayscale.

Figure 8

Examples from the original dataset for each one of the classes.

As we can see, categorizing the different classes is not a simple task since they are very similar. It takes the knowledge of an expert to do so with confidence. This, of course, makes the classifiers’ task more difficult. With GANs, it is possible to generate more examples for each class which should help the classifier.




For the experiments, we decided to focus on a single similarity metric, the Normalized Cross-Correlation (NCC), and a single EC algorithm, the GA. 

Moreover, we also modified the fitness function. First, we find the centroid of the set that includes both the images of the original dataset and the set of the individual. Then we take the average of the measure of similarity between each image of the generated set and the centroid image, using the NCC.

Figure 9

Visual demonstration of the calculation of the fitness function. Dotted arrows for calculation of the centroid. Blue arrow for calculation of image similarity


Now, when it comes to the data augmentation, we decided to double the size of the training set, meaning that we added 40 new images to each class, making it a total of 320 training images. It is important to note that, in order to control how the images were being generated for each class, we had to evolve a separate set for each, meaning 1 generator and 1 supervisor for a single class, making a total of 4 generators and 4 supervisors.


As for the training of the classifiers, both the augmented and non-augmented training was performed using cross-validation with 5 folds on 5 different random seeds for evolutionary runs.


Results in Classifier Improvement


When comparing the images of the original set with the images generated by the trained GANs, we can see that, in general, the images are generated with fairly good quality.

Figure 10

Comparison between original and generated images, left and right column respectively, for each one of the classes, Normal, Tapered, Pyriform, Amorphous, from top to bottom

The outcome of the latent space exploration for the evolution of the image sets presents clear differences between the best individuals of the initial and final generations, that even lay people can notice.

Figure 11

Comparison between the initial and final, 0th and 500th, generations of the supervision process for the Normal class

Finally, the results for the 5 runs of classifier training, both for the original and the augmented datasets, in the test phase, show that on average, the classifiers trained with the augmented dataset reached better performance results across all metrics, than those with no augmentation.

Figure 12

Performance results comparison, between original and augmented training of classifiers, on the test set


Even though the difference is minimal, and only 5 runs are not enough to make definite conclusions, this shows that our hypothesis is plausible and gives hope for further work and improvement.


To conclude, this is a theme that requires further exploration, but the current results show a promising prospect. Future work is planned to include experiments with RGB images, supervision with a previously trained classifier and supervision within the training of a classifier using the classifier being trained itself.




  • P. Fernandes, J. Correia, and P. Machado, “Evolutionary Latent Space Exploration of Generative Adversarial Networks,” in Applications of Evolutionary Computation – 23rd European Conference, EvoApplications 2020, Held as Part of EvoStar 2020, Seville, Spain, April 15-17, 2020, Proceedings, 2020, pp. 595-609.

  • P. Fernandes, J. Correia, and M. Penousal, “Towards Latent Space Exploration for Classifier Improvement,” in 24th European Conference on Artificial Intelligence (ECAI 2020) – ADGN20: First workshop on Applied Deep Generative Networks, 2020.