How to generate node-embeddings using encoder-decoder in Graphs
This is the third blog of the Graph ML Blogs series. If you have already followed here or your prior knowledge only asks for this, let’s begin.
As previously discussed, we can convert a network into numbers (a matrix, basically), which is more or less understandable because we design it with our hands (features). But then you ask yourself, are these the best numbers to which I can convert my graph and achieve the best results? Well, maybe not; let’s find out.
What’s New?
Well, nothing. The task is the same: convert every node to a vector to use it for some downstream task. But instead of us doing all the manual labour, let’s draw some algorithm that will do the work for us, and we expect it to be more efficient than our hand-designed features (why?). Node features are hand-designed for specific tasks and do not generalize across specific tasks.
Node Embeddings
What are embeddings, you ask? These are basically vectors, but algorithm-generated, not hand-designed (features). We want embeddings to encode network information, like node similarity, topology, etc.
Now, to make embedding, people use a framework called Encoder-Decoder Structure. People from an NLP background might be well aware of this structure, but let’s quickly define it for our GraphML use.
Encoder-Decoder:
So, in our use case, the encoder will convert the nodes to individual embeddings, and then the decoder will give them a similarity score (for a pair of nodes, obviously). An analogy for understanding would be, let’s say nodes are different types of steel, and our encoder is a newbie katana maker who is converting them into some beautiful katana. And our decoder, you ask? He is none other than Mr. Masamune. He takes two Katanas in his hands and tells if they came from the same steel or not (basically giving them a similarity score).
In the above picture, we can see the same as explained above. If A and B were more similar in the original network than B and C, we expect A-B to have a greater similarity score than B-C. But what if Mr Masamune misjudges and gives an incorrect similarity score? We’ll discuss about it later in the blog. Another question you might want to ask is: What is this similarity we are talking about? So, let’s formalize things a little bit for a better understanding.
Enoder Function: V → Rᵈ
Decoder Function: (Rᵈ, Rᵈ) → R
Where d is the embedding dimension, and V is some node from the graph.
Similarity: It’s a bit subjective and depends on the use case. For example, you can define connected nodes as similar and non-connected nodes as non-similar, etc.
Now, people make encoders from learnable parameters because you want enough plasticity in your student to learn the complexity of making some good swords. But what about Mr Masamune being wrong? Since we don’t have true legend Mr Masamune living among us now, we want our decoder also to be plastic enough to learn from its own mistakes, i.e., the decoder will also be learnable. From where will they learn? The similarity between the nodes will trigger the learning of the encoder and decoder. You’ll find different types of Encoder-Decoder in the literature, depending on the use case.
Embedding
I hope you understand the basic encoder-decoder architecture. Let’s give one practical example for all the terms to understand. Now, for a network, a very basic encoder will be an “Embedding Lookup”, which is nothing but a matrix that stores embeddings for each node. Every node will come, and our matrix will return an embedding vector (Matrix multiplication of Embedding matrix with one hot node vector). Now, the Decoder could be as simple as a dot product (since we are measuring the similarity between two vectors). Plus, it’s a parameter-free method. Now, similarity could be nodes sharing an edge, neighbour or structure, etc.
Let’s talk about one example from this, Node2Vec. But to understand it, we need to understand Random Walks.
Random Walk Approach
We need a more general scenario than simply taking directly connected nodes to measure similarity. Here, we will try to define similarity using a random walk, and you will see why it is more general.
What is a random walk, you ask? Well, stand on a node and start walking. This is the easiest explanation I think I can come up with.
Here, the random walk from node A (starting node) will be A-B-C-D-E. Note: Some People allow the same edge or node multiple times, like A-B-C-D-C. Now, coming back to similarity: listening to random walk, I’ll take the random walk nodes (B, C, D & E) to be similar to A, and all other nodes of the network will be treated as dissimilar. Some people take multiple random walks for a more robust comparison.
Now all the basics are done, let’s generate the embeddings. Steps for the same would be: generate “similarity nodes” for all the nodes (using Random Walks), generate random embedding for all the nodes (Random value Matrix for Embedding Lookup or could be feature matrix), use a Decoder to check how similar the embeddings are for a pair of similar nodes in the network. Well, you remember Mr. Masamune checking Katana? This is a version of the algorithm where he gives his student (encoder) feedback, and he will iteratively improve his Katana-making skills; some nerds call this process backpropagation (you know why).
Now, for backpropagation, we need a loss function that can tell if we are doing good or not. Think about it: we have to embed Zᵤ⁰ for node u and Z⁰ᵥ for node v (suffix 0 denotes its first embedding), and they co-occur on a random walk (implies similarity). The similarity could also be checked with a dot product decoder (and then we’ll compare both similarities using the loss function), but we want to emphasise that node A is dissimilar to the other nodes not occurring on the Random Walk.
So our Decoder could be:
And we can apply softmax for fun. Just kidding, softmax will make sure that node A is more similar to node B than any other node (making rich richer).
Now all b, c, d & e are similar to a. You see, this is a black-white scenario; either a node is similar, or it’s not. Maybe in your problem statement, you can define a “more-similar, less-similar” scenario, like which node occurs more often in random walks, etc.
So our total loss is:
This is the end-to-end training. The Loss function will give the gradients to update the parameters eventually. Given the encoder is just an embedding lookup, the whole pipeline could be:
Now, we see the denominator term in our Decoder, which we need to compute for all nodes with all nodes. It is a total disaster for very large networks. Solution? Negative Sampling. Basically, the denominator was there to give us a kind of dissimilarity number, right? We can have it, but instead of “all nodes with all nodes”, we can do “all nodes to few nodes”. Normally, in industry, people try to take around 5–20 negative samples.
Finally, this has become a normal optimization equation, which we can solve easily.
Examples
This blog has already become longer than I expected, but I still want to include an example section. I’ll try to keep it short.
Node2Vec:
This paper simply tries to generate multiple random walks for each node. And what do random walks and sentences have in common? Well, RW can be thought of as a sentence if nodes are the words. So they apply Word2Vec on the Random Walks.
However, the novelty lies in the RW generation process. They take a node, do an already decided fixed length RW and do it multiple times. Now, during every new node visit, we have three options: either stay at the same distance to the root node, come closer or move away even further. Using this biased hyperparameter governed 2nd order RW, they were able to beat the SOTA at that time.
Apart from some intelligent Markov Chain use and extending the algorithm to pair-of-nodes (edges), this is the summary of the whole paper.
Useless Features?
Now that the Node Embedding chapter is complete, I want to drop some more knowledge bombs from my own graph experience.
The first thing is that features are ‘designed’ and embeddings are ‘learned’. Basically, features are more task-specific, while embeddings try to capture the semantic and topological information. But is there a way to ‘learn’ features? There are some supervised feature learning techniques, but they are not popular enough in practice because they are specific to the task, so they decrease their importance (the objective function is designed to increase the downstream prediction accuracy directly). Secondly, they are very computation-demanding and task-specific. Even if they can sometimes deliver better results than general-purpose embeddings, they are not used much.
Summary
Thank you very much for reading the whole blog. I hope it helps you understand how embeddings are generated in graphs using an encoder-decoder structure. Keep reading to understand the beautiful world of graphs :)