Driving school for autonomous cars

What are datasets? A dataset is typically a big bunch of data, for instance, a database of written letters, digits, images of human faces, stock market data that scientists can use to test their algorithms on. If two research groups wish to find out whose algorithm performs better at recognizing traffic signs, they run their techniques on one of these datasets and test their methods on equal footings. For instance, the CamVid dataset stands for Cambridge-driving Labeled Video Database, and it offers several hundreds of images depicting a variety of driving scenarios. It is meant to be used to test classification techniques: the input is an image, and the question is for each of the pixels, which one of them belongs to what class.Classes include roads, vegetation, vehicles, pedestrians, buildings, trees and more.

These regions are labeled with all the different colors that you see on these images. To have a usable dataset, we have to label tens of thousands of these images, and as you may imagine, creating such labeled images requires a ton of human labor. The first guy has to accurately trace the edges of each of the individual objects seen on every image, and there should be a second guy to cross-check and make sure everything is in order.

That’s quite a chore. And we haven’t even talked about all the other problems that arise from processing footage created with handheld cameras, so this takes quite a bit of time and effort with stabilization and calibration as well. So how do we create huge and accurate datasets without investing a remarkable amount of human labor?

Well, hear out this incredible idea. What if we would record a video of us wandering about in an open-world computer game, and annotate those images. This way, we enjoy several advantages:

1. Since we have recorded continuous videos, after annotating the very first image, we will have information from the next frames, therefore if we do it well, we can propagate a labeling from one image to the next one. That’s a huge time saver.

2. In a computer game, one can stage and record animations of important, but rare situations that would otherwise be extremely difficult to film. Adding rain or day and night cycles to a set of images is also trivial because we simply can query the game engine to do this for us.

3. Not only that, but the algorithm also has some knowledge about the rendering process itself. This means that it looks at how the game communicates with the software drivers and the video card, tracks when the geometry and textures for a given type of car are being loaded or discarded, and uses this information to further help the label propagation process.

4. We don’t have any of the problems that stem from using handheld cameras. Noise, blurriness, problems with the lens, and so on are all non-issues. Using this previous CamVid dataset, the annotation of one image takes around 60 minutes, while with this dataset, 7 seconds.

Thus, the authors have published almost 25000 high-quality images and their annotations to aid computer vision and machine learning research in the future. That’s a lot of images, but of course, the ultimate question arises: how do we know if these are really high-quality training samples? They were only taken from a computer game after all! Well, the results show that using this dataset, we can achieve an equivalent quality of learning compared to the CamVid dataset by using one-third as many images. Excellent piece of work, absolutely loving the idea of using video game footage as a surrogate for real-world data.

Are you a gamer? then you could help us in Open Source Self-Driving Car (OSSDC.org) development! OSSDC can use your help to generate datasets from lots of driving/simulator games, Follow @gtarobotics for more details.


See the scene around min 17, that would be tricky to do in an SDC.


AI to Simplify

There are two main types of image files: Raster and Vector. Raster images are more common in general such as jpg, gif, png, and are widely used on the web. Vector graphics are common for images that will be applied to a physical product. 

When using a raster program you paint an image and it’s similar to dipping a brush in paint and painting. You can blend colors to soften the transition from one color to another.

When using a vector program you draw the outline of shapes: e.g. an eye shape, a nose shape, a lip shape. These shapes display one single color each.


A lot of images can be made with either raster or vector program and look exactly the same on both programs. Images with a subtle gradation of one color to another are the images that will look most different since vector programs need to create a separate shape for each shade of color.


Some vector programs do have the ability to create color gradients within one single shape, but these are actually raster effects. A vector graphic with gradients contains both vector and raster elements and won’t be suitable for a process that requires 100% vector or true vector art. 

Why are we not using vector graphics everywhere?

1) The smoother the color transitions and the more detail we have in our images, the quicker the advantage of vectorization evaporates. 

2) Image tracing procedure is not trivial and heavily depends on vectorization algorithm in terms of output quality. It is often unclear in advance whether it will work well on a given input. So now we know everything we need to know to be able to understand and appreciate this amazing piece of work. The input is a rough sketch, that is a raster image, and the output is a simplified, cleaned-up and vectorized version of it. We’re not only doing vectorization but simplification as well. This is a game changer, because this way, we can learn on the additional knowledge that these input raster images are sketches, hand-drawn images, therefore there is a lot of extra fluff in them that would be undesirable to retain in the vectorized output, therefore the name, sketch simplification. 


In each of these case, it is very impressive how well it works. The next question is obviously, how does this wizardry happen? It happens by using a classic deep learning technique, a convolutional neural network.


A convolutional neural network is trained on a large number of input and output pairs. However, this is no ordinary convolutional neural network! This particular variant differs from the standard well-known architecture as it is augmented with a series of upsampling convolution steps. Intuitively, the algorithm learns a sparse and concise representation of these input sketches, this means that it focuses on the most defining features and throws away all the unneeded fluff. And the upsampling convolution steps make it able to not only understand, but synthesize new, simplified, and high-resolution images that we can easily vectorize using standard algorithms. It is fully automatic and requires no user intervention.

Sketch Simplification researchers performed a user study and compared against top vectorization tools that work directly on raster images, Users preferred this approach over 97% of the time with respect to either of the two tools.


Neural networks are remarkably efficient tools to solve a number of complicated problems. The first applications of neural networks usually revolved around classification problems. Classification means that we have an image as an input, and the output is, let’s say a simple decision whether it depicts a cat or a dog. The input will have as many nodes as there are pixels in the input image, and the output will have 2 units, and we look at the one of these two that fires the most to decide whether it thinks it is a dog or a cat. Between these two, there are hidden layers where the neural network is asked to build an inner representation of the problem that is efficient at recognizing these animals. 


So what is an autoencoder? An autoencoder is an interesting variant with two important changes: first, the number of neurons is the same in the input and the output, therefore we can expect that the output is an image that is not only the same size as the input, but actually is the same image. Now, this normally wouldn’t make any sense, why would we want to invent a neural network to do the job of the copying machine? So here goes the second part: we have a bottleneck in one of these layers. This means that the number of neurons in that layer is much less than we would normally see, therefore it has to find a way to represent this kind of data the best it can with a much smaller number of neurons. If you have a smaller budget, you have to let go of all the fluff and concentrate on the bare essentials, therefore we can’t expect the image to be the same, but they are hopefully quite close. These autoencoders are capable of creating sparse representations of the input data and can therefore be used for image compression. I consciously avoid saying “they are useful for image compression”.

Autoencoders, offer no tangible advantage over classical image compression algorithms like JPEG. However, as a crumb of comfort, many different variants exist that are useful for different tasks other than compression.


There are denoising autoencoders that after learning these sparse representations, can be presented with noisy images. As they more or less know how this kind of data should look like, they can help in denoising these images. That’s pretty cool for starters! What is even better is a variant that is called the variational autoencoder that not only learns these sparse representations but can also draw new images as well. We can, for instance, ask it to create new Monalisa image and we can actually make her smile 🙂


Recurrent Neural Networks

Artificial neural networks are very useful tools that are able to learn and recognize objects on images, or learn the style of Van Gogh and paint new pictures in his style. 
In recurrent neural networks. The “Recurrent” nature can be explained as following, With an artificial neural network, we usually have a one-to-one relation between the input and the output. This means for one image input, we get classification output, whether the image depicts a cat or a dog. With recurrent neural networks, we can have a one-to-many relation between the input and the output. The input would still be an image, but the output would not be a word, but a sequence of words, a sentence that describes what we see in the image. For a many-to-one relation, a good example is sentiment analysis.


This means that a sequence of inputs, for instance, a sentence is classified as either negative or positive. This is very useful for processing product reviews, where we’d like to know whether the user liked or hated the product without reading whole paragraphs. And finally, recurrent neural networks can also deal with many-to-many relations by translating an input sequence into an output sequence.
Examples of this can be machine translations that take an input sentence and translate to an output sentence in a different language. For another example of a many to many relations, let’s see what the algorithm learned after reading Tolstoy’s War and Peace novel by asking it to write in that style. It should be noted that generating a new novel happens letter by letter, so the algorithm is not allowed to memorize words.
Let’s take a look at the results at different stages of the training process. The initial results are, well, gibberish. But the algorithm seems to recognize immediately, that words are basically a big bunch of letters that are separated by spaces. If we wait a bit more, we see that it starts to get a very basic understanding of structures – for instance, a quotation mark that you have opened must be closed, and a sentence can be closed by a period, and it is followed by an uppercase letter. Later, it starts to learn shorter and more common words, such as fall, that, the, for, me. If we wait for longer, we see that it already gets a grasp of longer words and smaller parts of sentences actually start to make sense. Here is a piece of Shakespeare that was written by the algorithm after reading all of his works. 

You see names that make sense, and you really shave to check the text thoroughly to conclude that it’s indeed not the real deal. It can also try to write math papers. I had to look for quite a bit until I realized that something is fishy here. It is not unreasonable to think that it can very easily deceive a non-expert reader.


Can you believe this? This is insanity. It is also capable of learning the source code of the Linux operating system and generate new code that also looks quite sensible. 
So, recurrent neural networks are really amazing tools that open up completely new horizons for solving problems where either the inputs or the outputs are not one thing, but a sequence of things.