Designing Features for Graph ML

Deepanshu Bagotia
6 min readFeb 12, 2024

--

This is the second blog of the Graph ML Blogs series. If you have already followed here or your prior knowledge only asks for this, let’s begin.

Photo by Kelly Sikkema on Unsplash

How to Solve the Problem?

Now, as discussed, the broad spectrum of problems in Graph ML is node level, edge level, and graph level. So, where to start applying ML?
If this were a standard machine learning task, we would have checked the feature set of our data and fed it to our algorithm and DONE! But how will our algorithm (will discuss what it is) eat our raw graph? So let’s cook.
As you will see, people generate specific features for a specific level of problem. E.g., in a node-level prediction problem, brainstorm how to develop a node feature. Then, do it for all the nodes (complete graph). And this is how we get feature data that looks more like traditional ML. Designing these features is the key to achieving good results. Let’s discuss in detail:

Node Level

Intro:

Let’s say you have a Twitter user graph, and you want to find who is a celebrity and who is an average user. Setting the node feature proportionate to the node degree should work fine.
Voila! You have used feature designing and solved a graph problem using ML. Unfortunately, problems in the real world are a bit tricky, and we need to design the features according to our problem needs. Ideally, your node feature should represent/capture the graph’s structure (typically local) and position/importance (w.r.t. other nodes). But naively, for a Twitter user, its # of posts, age, gender, etc., are also its features; how relevant they are to your specific problem is a different thing.

How?

It is only sometimes that you need to design the features. Some ready-made options people use are:
Node-Centrality: It’s node degree but with node importance. Modified versions are eigenvalue centrality, closeness centrality, etc.
Clustering coefficient: It measures how connected the neighbouring nodes are. As the name suggests, the stronger the cluster (more internal edges), the bigger the coefficient. Essentially, it’s counting the # of triangles the node is involved in.
Graplets:
This tries to count the # of graphles the node is involved in. Graphlets in themselves are a topic, so I’ll motivate you to give it some time and understand it separately.

Example:

In a graph of users and items, if you want to build a recommendation system, how will you leverage the node feature techniques you have just learned? Well, based on the assumption that users with high clustering coefficients may be more likely to be interested in items popular within their community, while those with low values may be more open to exploring diverse recommendations. So, if we take the clustering coefficient as one of the features of nodes, it surely would be an important feature in giving the recommendations.
Like this, node features can be exploited based on the use case.

Link Level

Intro:

Let’s say we want to predict missing links in the graph or suggest new links (recommending Facebook friends, new shopping items, etc.). Here, we will design features for pairs of nodes. Training could be like this: Given the features from the complete graph, we will remove some edges, try to predict node pairs (non-existing edges), and then evaluate based on the edges previously removed and the model’s predicted edges. Evaluation could be as simple as counting how many original edges appear in sorted top k predictions.

How?

Let’s try to think about how we can make some link-level features. After studying node-level features, a naive idea would be to aggregate both node’s features for every node pair. Voila! You have created very basic link-level features. Using these features, you may suggest LinkedIn connections like those from the same college because their node-level features match, but what about a friend’s friend? Ah… will not work.
So, let’s see a few standard approaches people generally use:
Distance-Based features: For a pair of nodes, it is proportionate to the shortest distance between them. This can be helpful for sparse graphs. However, this does not capture the degree of neighbourhood overlap.
Local Neighborhood Overlap: Here, we check the # of shared neighbours between two nodes to make features—standard, e.g., the # of common neighbours, Jaccard’s coefficient, Adamic-Adar index, etc. But if there are no common neighbours, these feature approaches fail.
Global Neighborhood Overlap: Local features ignore 2 or 3 or greater hop neighbours totally by giving them 0 weight, but there can be a potential future link. So here, we will build features using the Katz index, which counts the number of walks of all lengths between a given pair of nodes.

Example:

How can we take advantage of what we just studied in a social media graph if we want to suggest a new user-to-user connection? Well, a very basic idea would be to add an extra feature, “# of common neighbours,” in the user’s existing features. It will definitely help in new friend suggestions if not exceptionally improve it. Because, at the core, we believe that if two users share many common friends, they may know each other.

Graph Level

Intro:

Sometimes, we want to build graph-level features to compare different protein structures, molecular graph generations, etc. How do you make a feature that can capture the whole graph’s information? Let’s start thinking from the basics. A very naive idea would be to aggregate all the node-level features we made earlier, but that would neglect the graph topology. What if we take both node-level and edge-level features? Well, that might help, but it’ll still miss spatial relationships and interdependencies, which I think you’ll understand by the end.

How?

Now, in the first place, why do you need graph-level features? To compare graphs, right? Arguably and broadly, graph comparison algorithms could be set-based, frequent subgraph-based, and kernel-based. Set-based are computationally effective, if not best of all. Frequency-based are theoretically sound but lag computationally in taking the central stage. In the middle of these two comes the Kernel-based approaches. I’ll motivate you to dedicate some time to kernels separately. You can read this or any other source if you want. I can give a general intro here:

Kernels: These are functions from (G, G’) → R, where G and G’ are graphs. Ideally, we want two almost similar graphs to be close in the kernel space and vice versa. But how are you going to take care of that? Well, here comes the Complete Graph Kernel problem, which implies that if a Graph kernel is complete, then the kernel will output 0 for two isomorphic graphs and a non-zero value for non-isomorphic graphs. So, intuitively, a non-complete graph kernel would be useless because it may map two different graphs to the same location. But computing a complete graph kernel is hard, I mean almost hard.
But don’t get sad; someday, NP will become P. But till then, the problem has been a trade-off between completeness and complexity (you don’t want to go into exponential computation time but want to distinguish between non-isomorphic graphs).
Traditionally, you must have seen kernels like K(G, G’) = ϕ(G).ϕ(G’). This ϕ takes a graph as input and outputs a low-dimension vector. In this vector space, similar graph structure embeddings should be close and vice-versa. And our age-old friend, dot product, is used as the similarity function.

Image from Prof JP Vert’s slides

Graphlets: The simplest definition could be how many k-node subgraph configurations are possible (including non-connected graphs, unlike node graphlets). Now think of it this way: the graphlet frequency distribution in individual graphs would give a reasonable estimate of how similar the two graphs are because, essentially, it lets us know, on k-hop space, what the graphs compare to. If we do it enough (many hops), we can be confident of the graph similarity, but that will come with a very expensive computation cost. Since graphlets represent graphs quite effectively, why not use them with kernels? Well, people do. But for starters, I recommend reading this.

Examples:

In a financial graph of cards and merchants, we can compare graph structures to detect odd ones to find anomalies or suspicious transactions. We can also compare protein structures with kernel methods very efficiently.

Summary

This is it from my side on designing features in graphs for using them in ML. Don’t forget to do your research on these topics for extra knowledge, and please allow me to correct myself if you find anything wrong. Thanks a lot for reading.

--

--

Deepanshu Bagotia
Deepanshu Bagotia

No responses yet