Understanding NN intutively with AutoEncoders

Deepanshu Bagotia
6 min readNov 2, 2024

--

Image taken from the internet

You must have seen the above image thousands of times. But what exactly is going on with the weights, and how does it really learn and do what exactly we expect? Let’s understand.

I am assuming you are familiar with things like SGD and all. You basically take an image of a dog, make a pixel vector of that image, and multiply it with a matrix (or two or more); we call them weights. Now, the output is evaluated with the expected output; by evaluating, we mean checking the difference between the two values, but there are many ways of checking the difference between two vectors and what you choose defines everything your NN will learn. Finally, in the non-convex space of our NN weight, we can take the gradient and update the weights to decrease the loss, and our NN will finally learn.

This was basic stuff; let’s understand AutoEncoders quickly:

Auto Encoders

When I first read about it, I judge by the name that it must be a very complex thing. But it’s not. It’s one of the most basic ideas of ML that everybody should be familiar with. You see, if you pass an image through a NN, it will produce some vector output, but what does this vector mean? Literally nothing. It’s useless. Yeah, no kidding.
What you have done is transform an image from one dimension to another. A 32x32 image of a dog made sense to me; now you converted it to a 1024-size vector (32*32) by passing it through a NN and transforming it to a 256-size vector. Let’s say that if converted back to a 16x16 image, it will probably look like a random pixel arranged here and there that only aliens can understand.

But that’s the beauty sometimes: you don’t need to know EXACTLY what you want, and it will still do it for you. And yeah, I am still talking about NNs, not your girlfriend. Still not convinced? Let’s take a toy example.
If your dataset is images of circles, you know that you don’t need such high-dimension images (let’s say 32x32) to store the same information. You can store a vector of 2 dimensions, centre and radius, and you can anytime recreate your dataset as it is. How will you convey this to your girlfriend? Oh, sorry, I mean NNs. Let’s see:

image from the internet

In the above image, the Encoder and Decoder are nothing but simple NNs. Let’s say the original input was a circle image, the compressed representation was a 2-dimensional vector, and the Decoder outputs the same dim image as the original input. Now, do just one thing: compare the decoder’s output with the original input and apply a loss function that brings two of them closer, and eventually, the loss function will drive the NNs space to learn what you want, i.e., the compressed representation will learn to see the centre and radius from the images (because the decoder is able to reconstruct the image (circle) from that 2d vector.) Did you get it now?

You can argue that the compressed 2d vector could be something else other than the centre and radius, and the decoder could still reconstruct the circle. But as humans, the (centre, radius) example was understandable. But it’s not like what we can’t see doesn’t exist; there could be some other 2d hidden vector that can express a circle, and NNs could learn that. I don’t have a math degree, so please don’t shout at me on this one.

Now, take this knowledge to the dog’s example. You asked your NN, “Bro, learn something in a 64-dim space that can express a dog photo”. You will also get a compressed vector for your dog photo. Unlike the circle example, this may or may not be a vector you can comprehend by your human brain, but it has that information; we just can’t see it (or can we?)

Where can you use Auto Encoders?

After training through the decoder and using the encoder to convert images to low-dimension space, you can use that vector for image classification. Auto Encoders can help you with feature selection, denoising, etc.

Now, if you came here just to get a different perspective on NN, the blog is over. I just wanted to emphasise that NNs just split useless vectors in another dimension, which was guided by the loss. What do you want your vector to be? Define your loss that way. (Now, this is not to say architecture is of no use; if architecture can learn something, it will be guided by the loss; if there are architectural limitations, the loss won’t help much.) If you want to learn more about Auto Encoders and VAEs, please keep reading.

Where do Auto Encoders fail?

If I have to tell you the answer in one single line, it’s simple: NNs are like your lazy friend who tries to memorize the Q-A mapping before the exam instead of understanding the logic (unless you make them learn the logic). You see, it’s easy for an encoder to map the image to a latent space point, and then that point will be decoded back to the image. The problem is it’s a point; given the decoder parameters, a point can only be converted to a single image. What does that mean?

In the above image, we see 1 and 7 are translated to 1 and 7 by an autoencoder trained on some digit dataset. But it generates something utterly random for the image of ‘T’, which you expected to be something between 1 and 7 (best case: if it could generate exactly T). What does it tell you? There is no generalization; only the mapping is learned (and that, too, is sharp, not smooth).

One Naive solution to this problem is that instead of learning points in the vector space, what if we learn the distribution of each image of the data? That would generalize well, and we could expect different images (because every time, a random point will be picked from the latent space distribution) of a number every time the decoder runs (not just the memorized one). Also, Distributions will fuse together, for e.g., of 1 and 7, and will give a smooth transition, not some random image just like above.

Sounds good? I'm sorry for spoiling it for you, but don’t try this approach. It won’t work. The logic is that NNs are trained for what you ask, not what you want. For your reconstruction loss, let’s say MSE, the encoder will collapse the distribution it should learn, i.e., it will keep the variance very low because that way, the loss will be reduced better.

Moving Towards Variational Auto Encoders (VAEs):

Again, NN fools you by learning a distribution (you want) but collapsing it to learn a point eventually (you asked for) because there was no condition not to. So let’s put some conditions. What we want is for latent space to be a distribution so we can keep the base as a standard Gaussian and have a loss that penalizes every time we learn something that’s not near the Gaussian (for example, a KL-divergence between Gaussian and the learned distribution). Now, this approach will not only prevent the collapsing but will also keep all distributions close enough (because our base is 0 mean for everyone). Also, this will help cover a wide area of latent space rather than a few points in the space.
So Instead of learning the latent space vector from the encoder, we will try to learn the mean and variance (basically learning a distribution) and then will sample from this.

image from the internet

So out final loss is the reconstruction loss and the KL-Divergence loss. There are few more important technical details we are skipping but this blog is to give a high level intutive idea. We’ll go though the detailed math of VAEs in another blog. Till then keep exploring.

Thanks for reading!

--

--

Deepanshu Bagotia
Deepanshu Bagotia

No responses yet