Towards a Visual Language Using Neural Networks

Even before the existence of a formal writing system, the human species developed ways to communicate knowledge using proto-writing which consisted of ideographic symbols that represented a limited number of concepts [1]. In recent years, there have been proposals to design a universal written language based on ideograms. There also have been many approaches to study the origin and the evolution of language using computer simulations. However, the majority of these approaches are focused on the evolution of communication based on symbols or textual communication.




In this project, we proposed a system that aims to generate images that visually represent concepts by jointly training an encoder and a decoder to transfer a representation of a word through a noisy channel, following the Shannon model of communication [2]. Shannon states that the general communication system can be divided into five elements, the information source which produces the message, the transmitter which creates the signal, the channel which is the medium that carries the information, the receiver transforms the signal received to the original message and the destination for which the message is intended.


Figure 1

Schematic diagram of the Shannon model of communication


To obtain a vector representation of concepts, we used a word to vector dataset called Global Vectors for Word Representation where each distinct word is represented with a particular vector. For the generation of the images, we used two neural networks based on the Deep Convolutional Generative Adversarial Network architecture [3]. We modified the decoder implementation so its output is a reconstruction of the input vector of the encoder. The training process is the following:

similar to the one used in autoencoders. First, the encoder tries to encode a set of word vectors from the GloVe dataset in RGB or grayscale images. Then, some type of noise is applied to the generated images based on a set of transformations. These transformations consist of a set of rotations, translations and normalization of the pixel data. Finally, the decoder tries to reconstruct the original vectors based on the images received. The quality of the encoder and the decoder is assessed by evaluating how well the decoder is able to reconstruct the original vector. This way, the two networks are forced to cooperate to be able to converge to a vocabulary that is understandable by both.


Visualization Tool

To more easily explore the results obtained we developed a webpage that presents the generated images. We used a t-distributed stochastic neighbour embedding to transformed our higher dimensional vectors into a representation that we can visualize, in our case two dimensions to use as x and y value, and created a 2D world by placing the images in the corresponding positions.


Figure 2

A visualisation of the website developed for this project.



Usually, the quality of autoencoders [4] can be analysed based on the similarity of the vectors produced for each image. If the original images have strong similarities, the vectors produced must have similarities. However, when two images are different the autoencoder must be able to create distant vectors for each image. So, we decided to compare images of similar and different concepts to infer if the model is capable of creating similar images for related concepts while distancing them from different concepts.


Figure 3

Comparison between similar images (‘man’ and ‘woman’) and different images (‘islam’ and ‘chrysler’).


In the Figure 3 it is presented two pairs of images, a similar pair (‘man’ and ‘woman’) and a different pair (‘islam’ and ‘chrysler’). The two images that represent the words ‘man’ and ‘woman’ present very similar characteristics. Even though they are two antonyms, the context where they emerge is similar which results in similar images. The word to vector training is focused on the word associations and not on the meaning of the words, therefore, the word vectors that are closer are the ones that emerge in similar contexts. The second pair (‘islam’ and ‘chrysler’) was selected based on the two words from the dataset with the biggest distance between them. As it is possible to observe, the two images are very different from each other.

One of the properties of a word to vector architecture is the semantic and syntactic patterns that can be reproduced using vector arithmetic. In our analysis, we investigated if these patterns can be reproduced in our model. In a first experiment, we calculated the distance from the vector that represents a country to the inhabitant in that country. Then, we created another word vector by adding this distance to a vector that represents a different country and generated the corresponding image.

Figure 4

Comparison between the images generated with the ‘american’ produced using vector arithmetic and the ‘american’ from the dataset.


As it is possible to observe in the Figure 4, the image generated using the real vector (‘american’) and the image generated using the word vector created using the method previously described (‘interpolated american’) are very similar, which indicates that our model can produce images similar to ones of existing vectors through vector operations.

We expanded our analysis beyond countries and nationalities to assess if these properties can also be observable in verbal tenses. First, we calculated the distance that goes from a verb in the present tense to its past tense and synthetically created past tenses for other verbs.


Figure 5

Comparison between the images generated with the ‘took’ produced using vector arithmetic and the ‘took’ from the dataset.


The Figure 5 presents the comparison between both images, the image generated using the vector that we synthetically created and the image generated by the original vector that represents the word ‘took’. As it is possible to observe, our model is also able to generate images to represent the past tenses using the present tense. These proprieties might be useful when some concepts are not available in the dataset, so we can generate the word vector synthetically and then the image that represents it.

With our visualisation tool, it is possible to observe some behaviours that can be unexpected. One example of this behaviour is related to the word ‘one’. In the visualisation, it is possible to observe some groups of words that are formed with related words. The same behaviour is observed in the numbers, however, as the word ‘one’ emerges more associated as a single unit or individual it is placed farther from the rest of the numerals.


Figure 6

Visualisation of the group formed by the numbers.


Related links



[1] Schmandt-Besserat, D. (2014). The evolution of writing. Austin, Texas: University of.

[2] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.

[3] Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434.

[4] Makhzani, A.; Shlens, J.; Jaitly, N.; and Goodfellow, I. 2016. Adversarial autoencoders. In International Conference on Learning Representations.