Introducing Graphs to ML
This is the first blog of the Graph ML Blogs series. It’s a basic starting guide for beginners to understand the graph world.
So, without any further ado, let’s dive into the field of graphs and apply ML.
What are Graphs?
I mean, let’s quickly clarify this before starting the actual topic.
Philosophy: Graphs are natural structures that exist in nature, or at least that’s what we believe because things naturally seem to live this way. We naturally generate and store information in them and even use them daily without noticing much.
Formally: Graph G is (V, E) where V is the set of nodes and E is the set of edges. The edge going from node u ∈ V to v ∈ V is defined as (u,v) ∈ E. For undirected (u,v), it is the same as (v,u).
An Easy way to represent graphs is through adjacency matrix A ∈ Rˣ where x ∈ |v| ×|v|
Graph G is (V, E) where V is the set of nodes and E is the set of edges.
Type of Graphs
Graphs are of many types based on the # of vertices, edges, and their overall structure.
Null Graph: Graph with no edge.
Trivial Graph: One vertex graph.
Simple Graph: No Parallel edge and no loops.
Undirected Graph: Edges do not have directions.
Directed Graph: Edges have directions.
Complete Graph: Every node is connected to every other node (excluding itself).
Connected Graph: A path exists between every two nodes.
Disconnected Graph: A path does not exist between every pair of nodes.
Regular Graph: All nodes have the same degree (number of incoming edges on a node).
Cyclic Graph: Graph having at least one cycle.
Acyclic Graph: Graph having zero cycles.
Bipartite Graph: Two sets of nodes have inter-set connections but no intra-set connection.
Weighted Graph: Where Edges carry some weights.
Apart from these basic categories, another category, called Multi-Relation Graphs, is based on a graph’s different types of edges.
Heterogeneous Graph: Here, we have different types of nodes and edges. E.g., a Twitter graph of users and their feed suggestions. The user-to-user edge will represent whether they follow each other, and the user-to-feed edge will represent the content they love to consume.
Multipartite is a special case of heterogeneous graph where edges connect only different types of nodes.
Multiplex Graph: A multiplex graph has multiple layers representing distinct relationships. Nodes remain constant across layers, but their connections vary. Inter-layer edges connect nodes, potentially across different edge types within the same layer, capturing multi-dimensional interactions.
For example, take all the cities of a state and arrange them in multiple layers. Now, within each layer, edges will represent connectivity via train in one layer and road in another. And inter-layer connections will connect only the same cities.
Feature Information: We have seen how a graph’s edges can represent different things because they connect different nodes. So, what are the different nodes? These different nodes are how we store information, or formally node-level attributes. For, in a biomedical graph of proteins, each node can store the protein name, its class, its structural information, and many more things.
Problems in Graph ML
Now that we have a basic background of graphs let’s see how ML can be used on graphs. One thing before starting: Graph ML is slightly different from traditional ML approaches. Algorithms can not always be directly replicated on them. Can you think of the reason why?
Well, the biggest reason is probably that traditional ML assumes IID behaviour, which Graphs usually don’t follow. Let’s see what kind of problem people try to solve using ML in Graphs.
Node Classification:
Problem: You have a Twitter graph of users. Some are labelled (and some are not) users, and others are bots. For the rest of the graph, you want to classify the whole graph into users and bots. It seems like an easy problem. Well, maybe not.
Solution: Firstly, we can not treat it as a standard supervised classification problem because the data points are not IID. So, instead of modelling an independent set problem, we now model an interconnected set of nodes. That is, GraphML models leverage graph connections to form GraphML algorithms.
One idea is homophily, the tendency for nodes to share attributes with other nodes. Another is structural equivalence, which is the idea that nodes with similar labels will share their local structure. And there are many more.
Relations Prediction:
Problem: Again, you have a Twitter graph of users and the content they consume; how would you suggest new content to them? One way to formalize this is by using a graph relations prediction problem. When a person is buying bread, we expect our model to predict the next relation between the person node and the milk node.
Solution: It is highly dependent on the type of graph. Whether it’s a simple Twitter graph, a complex biomedical graph, a single fixed graph, or multiple disjoint graphs, graph algorithms use the inductive bias of that particular graph structure.
Clustering and community detection
Problem: Again, if we only take a Twitter user’s graph and edges denote who you follow, it will not be one single big graph, but most likely, it will be a collection of many small graphs because everyone has their circle (with maybe some familiar celebrity). But we want to find these ‘circles’ or clusters.
Solution: Now, we are jumping to the unsupervised equivalent of graph ML. Here, we will find the clusters from the same group. Some real-world applications include uncovering functional modules in genetic interaction networks, uncovering fraudulent groups of users in financial transaction networks, etc.
Now there are more problems like Graph Classification, Regression, and Clustering. Which sounds more straightforward to traditional ML than other tasks we discussed. But as discussed, the datastructure is very different, and so are the algorithms.
Conclusion
This was the ‘hello world’ of the vast Graph ML field. But yeah, ‘hello world’ is also essential. Thanks for reading. I hope you find it useful in some way or another.