Using Autoencoders to Generate Skeleton-based Typography

Most emerging fonts continue to be developed by type designers who study the shape of each letter and its design with great precision, despite the emergence of these new possibilities. Type design is a hugely complex discipline, and its expertise ensures typography quality. Moreover, with the proliferation of web typography and online reading, the use of variable and dynamic fonts has increased, allowing more options for font designers and font users. Even though new computer systems create expressive and out-of-the-box results, they do not have the knowledge of an expert. But this is also an advantage, allowing non-arbitrary exploitation that extends the range of possibilities. It is necessary to create a balance to take advantage of the computational systems and the expert labour. Moreover, most generative systems that create typography focus on the letters’ filling and don’t see the structure of a glyph as a variation parameter.


To overcome these limitations, we propose an Autoregressive model that creates new glyph skeletons by the interpolation of existing ones. Our skeleton-based approach uses glyphs’ skeletons of existing fonts as input to ensure the quality of the generated results. The division of the structure and the filling of the glyphs add variability to the results. Different glyphs can be created by just changing the structure or the filling. The proposed approach enables the exploration of a continuous range of font styles by navigating on the Autoencoder learnt latent space. With the results of this approach, it is also possible to apply different filling methods that use the stroke width of the original letters to produce new glyphs.




One of the most important aspects of our approach is the collection and pre-processing of the dataset. We compile a collection of fonts in TTF font format with different weights from Google Fonts. This dataset is composed of five different types of fonts as Serif, Sans Serif, Display, Handwriting and Monospace. We opted not to use handwriting and display fonts because they were largely distinct from the rest, which is not desirable for our approach. Their ornamental component, sometimes not even filled, complicates the extraction of a representative skeleton.


After selecting the fonts, we remained with 2623 TTF files. Then, we use a library to extract the skeleton of a font file. It applies the Zhang-Suen Thinning Algorithm to derive the structural lines of a binary image. This library also allows the extraction of the points of the skeletons as well as the connections between them. It can also calculate the distance between the points and their closest borderline pixel, returning the stroke width of the original glyph at each of these points. For each font, we rasterise the vectors that compose the skeleton of each glyph into a 64x64px black and white image. We also save all points positions and stroke width of the original glyph in a file to use later to generate the filling of the glyphs.




The following diagram presents a summary of the architecture used (see Figure 1). In summary, the encoder processes the greyscale images and encodes them into two 64-D latent vectors which consist of a set of means (mu) and standard deviations (sigma) of a Gaussian representation. Then using the mean and standard deviation we take a sample from the Gaussian representation z to be used as input for both decoders, the image decoder and the sketch decoder. The image decoder consists of a neural network that receives the z vector and decodes it into a greyscale image which is compared with the original input. The sketch decoder consists of an LSTM that transforms the z vector into a sequence of 30 points creating a single continuous path. This path is rasterised using a differentiable vector graphics library to produce an output image which is compared with the original image. Although the standard Variational Autoencoder works at the pixel level, the output of our sketch decoder is a sequence of points, thus allowing the generation of scalable vector graphics that allow an easier manipulation of the generated skeletons without losing quality.


Figure 1

Diagram of the architecture of our approach.


Reconstruction of skeletons


Our model returns a sequence of points that, when connected, create a reconstruction of the skeleton image used as input. In most cases, the generated strokes reconstruct the basic features of the skeleton. For example, in the case of the letter “A”, the network first creates one stem, then the crossbar that connects both stems, and finally draws the second stem. Even though there is nothing to control the distance between points or to enforce them to be close, the network learns that it needs to connect both steams in the beginning and at the end of the sequence. Another interesting feature observable in the reconstruction is related to how the Artificial Neural Network (ANN) handles the letter “T”. This letter presents one of the simplest skeletons of the alphabet, so the network can learn how to generate the whole structure of the letter very quickly in comparison with others.


Latent representation of font style


To understand if the trained model can learn a latent representation for the different letters that is smooth and interpretable, we visualized the 64-dimensional z vectors for the dataset. So we take all the images of the dataset (68198 images) and encode them using our network. Then, using the means and standard deviations of each encoded images we took a sample from the distribution. Finally, we took all the z vectors and reduced their dimensionality using t-SNE algorithm. Figure 2 presents the visualisation of the results.


Figure 2

Diagram of the architecture of our approach.


As it is possible to observe, in general, the model can separate the different letters into clusters. In some cases it is also possible to observe that similar letters are placed near each other, for example in the case of the letters “B”, “R” and “P”. These three letters present similar anatomical characteristics, they share a top bowl and they all have a vertical stem, thus they are placed near each other. The same happens for the letters “T” and “I” which are placed more separated from the rest but near each other. Even though the majority of the skeletons for the letter “I” is represented with a single stem, in some cases, when they have serif, they are similar to the letter “T” but with a cross stroke on the top and bottom part of the letter. This leads to both letters having a strong similarity between each other, therefor they are placed together in the latent space.


Exploring the latent space


After analysing whether the latent space translates font characteristics for meaningful latent representation, we explore linear interpolations between pairs of skeletons for a given glyph. First, we encode two randomly selected fonts from the dataset into their corresponding z vectors. Then, we perform a linear interpolation between the two vectors and, using the trained sketch decoder, we reconstruct the skeletons for these vectors (see Figure 3).

The results show that the model is not only able to decode meaningful skeletons but it is able to control several characteristics of it. In the example of the letter “N”, not only the model can control the width of the letter, but it also controls its height. At the same time that the width of the letter changes, its height is also modified to match its parents, which allows wider control over the skeleton that can be created. In the case of the letter “T”, it is possible to observe that the model can also control how much the letter is italic. As we go from the left input skeleton image to the right the stem of the letter is getting closer to a vertical position.


Figure 3

Results of the latent space interpolation between different skeletons of same the letter.


We also interpolate between skeletons of different letters (see Figure 4). The results of the following image demonstrate that the model is able to pass from a skeleton to another from different letters. Sometimes the morphings are not even expected to be smooth, because some letters have anatomical parts completely different, like for instance the “Z” and “T”. The generated skeleton starts as “Z” but over time it loses its bottom cross stroke. Moreover, its diagonal stroke slightly changes its angle and transforms itself into the stem of a “T”. There are also other transformations that are expected, such as the case of “P” and “F”, which share a stem. Over the line, the generated skeleton goes opening its bowl to create the arms of the “F” and at the same time slightly inclines the stem to create an italic glyph according to the inclination of the “F”.


Figure 4

Results of the latent space interpolation between skeletons of different letters.


Transforming skeletons into glyphs


As mentioned before, the skeleton extraction library allows, in addition to extracting the points, obtaining the stroke width at each point of the skeleton. When we created the dataset, by extracting the skeletons of the uppercase letters of the Latim alphabet for each font file that we select, we saved the points of each skeleton and its stroke width to use posteriorly. With these values, we were able to interpolate the stroke width along with the generated skeleton. Figure 5 shows some results in which each row represents a different interpolation. Looking at the generated glyphs, we can see that they look similar to a regular font. With a few adjustments, we could use them as a variable font. Now, with interpolated fill, the contrast between variations is more visible, because we had another parameter to the glyph design. By splitting the skeleton and the filler we have more visual possibilities because we are not stuck with a filler. In these tests, we use filling in the original fonts to fill in the intermediate ones, but it is not mandatory. We can even use some fonts to create the skeleton and others to create the filling or even use a fixed value along the skeleton.


Figure 5

Results of the latent space interpolation filling the skeleton with an interpolated stroke width.


With this tool, designers can generate skeletons and develop a filling to create their versions of glyphs. Moreover, visual identities created nowadays are becoming more dynamic. Museums, institutions, organisations, events and media are increasingly relying on this type of identity. Designers should adapt their work to these new possibilities by creating dynamic identities with animations and mutations. As mentioned before, our system provides a tool to facilitate the process of building these dynamic identities with a typographic component. To demonstrate the application of our system we made a series of experimentations with different ways of using the obtained skeletons by our model (see Figures 6 and 7).


Figure 6

Example of application of the generated skeletons into glyphs to create a typographic identity.


Figure 7

Example of application of the generated skeletons into glyphs to create a typographic identity.


Code at: Github


  • J. Parente, L. Gonçalo, T. Martins, J. M. Cunha, J. Bicker, and P. Machado, “Using Autoencoders to Generate Skeleton-based Typography,” in (to be published in) Artificial Intelligence in Music, Sound, Art and Design – 12th International Conference, EvoMUSART 2023, Held as Part of EvoStar 2023, Brno, Czech Republic, April 12-14, 2023, Proceedings, 2023.