node2vec: Embeddings for Graph Data

【纯机翻,无视排版 https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef

Motivation

Embeddings… A word that every data scientist has heard by now, but mostly in the context of NLP.
So why do we even bother embedding stuff?
As I see it, creating quality embeddings and feeding it into models, is the exact opposite of the famous say “Garbage in, garbage out” .
When you feed low quality data into your models, you put the entire load of learning on your model, as it will have to learn all the necessary conclusions that could be derived from the data.
On the contrary, when you use quality embeddings, you already put some knowledge in your data and thus make the task of learning the problem easier for your models.
Another point to think about is information vs domain knowledge.
For example, let’s consider word embeddings (word2vec) and bag of words representations.
While both of them can have the entire information about which words are in a sentence, word embeddings also include domain knowledge like relationship between words and such.
In this post, I’m going to talk about a technique called node2vec which aims to create embeddings for nodes in a graph (in the G(V, E, W) sense of the word).

嵌入…每个数据科学家到目前为止都已经听到的一个词,但是主要是在NLP的背景下。那么,为什么我们还要打扰嵌入的东西呢?正如我所看到的,创建高质量的嵌入并将其输入到模型中,与著名的说法“垃圾进,垃圾出”完全相反。当您将低质量的数据输入到模型中时,您将整个学习工作都放在了模型上,因为它必须学习所有可以从数据中得出的必要结论。相反,当您使用质量嵌入时,您已经在数据中添加了一些知识,从而使模型的学习问题变得更加容易。要考虑的另一点是信息与领域知识。例如,让我们考虑单词嵌入(word2vec)和单词表示袋。虽然它们都可以掌握有关句子中哪些单词的全部信息,但单词嵌入还包括领域知识,例如单词之间的关系等。在本文中,我将讨论一种称为node2vec的技术,该技术旨在为图中的节点创建嵌入(以G(V,E,W)表示)。

I will explain how it works and finally supply my own implementation for Python 3, with some extras.

Embedding process

So how is done?
The embedding themselves, are learnt in the same way as word2vec’s embeddings are learnt — using a skip-gram model.
If you are familiar with the word2vec skip-gram model, great, if not I recommend this great post which explains it in great detail as from this point forward I assume you are familiar with it.

The most natural way I can think about explaining node2vec is to explain how node2vec generates a “corpus” — and if we understand word2vec we already know how to embed a corpus.

So how do we generate this corpus from a graph? That’s exactly the innovative part of node2vec and it does so in an intelligent way which is done using the sampling strategy.

In order to generate our corpus from the input graph, let’s think about a corpus as a group of directed acyclic graphs, with a maximum out degree of 1. If we think about it this is a perfect representation for a text sentence, where each word in the sentence is a node and it points on the next word in the sentence.

那怎么办呢?使用skip-gram模型,以与学习word2vec的嵌入相同的方式来学习嵌入本身。如果您熟悉word2vec skip-gram模型,那就太好了,如果我不喜欢,我建议您阅读这篇很棒的文章,其中对它进行了详细的解释,因为从现在开始,我认为您已经熟悉它。关于解释node2vec的最自然的方法是解释node2vec如何生成“语料库”-如果我们了解word2vec,我们就已经知道如何嵌入语料库。那么我们如何从图中生成这个语料库呢?这正是node2vec的创新部分,它以一种智能的方式做到了,这是使用采样策略完成的。为了从输入图生成语料库,让我们将语料库视为一组有向无环图,最大出题度为1。在句子中是一个节点,它指向句子中的下一个单词。

Sentence in a graph representation

In this way, we can see that word2vec can already embed graphs, but a very specific type of them.
Most graphs though, aren’t that simple, they can be (un)directed, (un)weighted, (a)cyclic and are basically much more complex in structure than text.

In order to solve that, node2vec uses a tweakable (by hyperparameters) sampling strategy, to sample these directed acyclic subgraphs. This is done by generating random walks from each node of the graph. Quite simple right?

Before we delve how the sampling strategy uses the hyperparameters to generate these sub graphs, lets visualize the process:

这样,我们可以看到word2vec已经可以嵌入图形,但是它们是非常特定的类型。 不过,大多数图形并不是那么简单,它们可以是(无)方向,(无)加权,(a)循环的,并且结构上比文本基本上复杂得多。 为了解决这个问题,node2vec使用可调整的(通过超参数)采样策略来采样这些有向无环子图。 这是通过从图的每个节点生成随机游走来完成的。 很简单吧? 在我们深入研究采样策略如何使用超参数生成这些子图之前,我们先将其可视化:

Node2vec embedding process

Sampling strategy

By now we get the big picture and it’s time to dig deeper.
Node2vec’s sampling strategy, accepts 4 arguments:
— Number of walks: Number of random walks to be generated from each node in the graph
— Walk length: How many nodes are in each random walk
— P: Return hyperparameter
— Q: Inout hyperaprameter
and also the standard skip-gram parameters (context window size, number of iterations etc.)

The first two hyperparameters are pretty self explanatory.
The algorithm for the random walk generation will go over each node in the graph and will generate <number of walks> random walks, of length <walk length>.
and P, are better explained with a visualization.
Consider you are on the random walk, and have just transitioned from node <t> to node <v>in the following diagram (taken from the article).

到现在为止,我们有了大的了解,是时候深入了解了。
Node2vec的采样策略接受4个参数:
—步数:将从图中每个节点生成的随机步数
—步行长度:每个随机步行中有多少个节点
— P:返回超参数
—问:Inout超级血压计
以及标准的跳过语法参数(上下文窗口大小,迭代次数等)
前两个超参数很容易说明。
用于生成随机游走的算法将遍历图形中的每个节点,并将生成长度为<游走长度>的<游走数量>随机游走。
通过可视化可以更好地解释Q和P。
考虑您正在随机游走,并且刚从下图中的节点过渡到节点(摘自本文)。

The probability to transition from <v> to any one of his neighbors is
<edge weight>*<α>(normalized), where <α> is depended on the hyperparameters.
P controls the probability to go back to <t> after visiting <v>.
Q controls the probability to go explore undiscovered parts of the graphs.
In a intuitive way, this is somewhat like the perplexity parameter in tSNE, it allows you to emphasize the local/global structure of the graph.
Do not forget that the weight is also taken into consideration, so the final travel probability is a function of:
1. The previous node in the walk
2. P and Q
3. Edge weight

从过渡到他的任何一个邻居的概率为
<边缘权重> * <α>(归一化),其中<α>取决于超参数。
P控制访问之后返回的概率。
Q控制探索图的未发现部分的可能性。
以一种直观的方式,这有点像tSNE中的困惑参数,它使您可以强调图的局部/全局结构。
不要忘了还要考虑重量,因此最终行驶概率是以下函数:
1.步行中的上一个节点

  1. P和Q
    3.边缘重量
    这部分很重要,因为它是node2vec的本质。如果您不完全了解采样策略背后的想法,强烈建议您再次阅读本部分。
    使用采样策略,node2vec将生成“句子”(有向子图),这些句子将用于嵌入,就像在word2vec中使用文本句子一样。如果可行,为什么要更改某些内容?

This part is important to understand as it is the essence of node2vec. If you did not fully comprehend the idea behind the sampling strategy I strongly advise you to read this part again.

Using the sampling strategy, node2vec will generate “sentences” (the directed subgraphs) which are will be used for embedding just like text sentences are used in word2vec. Why change something if it works right?

现在该将node2vec付诸实践了。
您可以在此处找到此node2vec测试驱动器的完整代码。
我以node2vec算法的实现为例,该算法增加了对分配节点特定参数(q,p,num_walks和游走长度)的支持。
利用欧洲足球队的组成,我们要做的是将7个不同俱乐部的球队,球员和职位嵌入其中。
我要使用的数据取自Kaggle上的FIFA 17数据集。
在FIFA中(通过EASports),每个团队都可以表示为图表,请参见下图。


Code (showcase)

Now its time to put node2vec into action.
You can find the entire code for this node2vec test drive here.
I am using for the example my implementation of the node2vec algorithm, which adds support for assigning node specific parameters (q, p, num_walks and walk length).

What we are going to do, using formation of European football teams, is to embed the teams, players and positions of 7 different clubs.
The data I’m going to be using is taken from the FIFA 17 dataset on Kaggle.
In FIFA (by EASports) each team can be represented as a graph, see picture below.

Formation example from FIFA17, easily interpreted as a graph

As we can see, each position is connected to other positions and when playing each position is assigned a player.
There dozens of different formations, and the connectivity between them differs. Also there are type of positions that are in some formations but are non existent in others, for example the ‘LM’ position is not existent in this formation but is in others.

This is how we are going to do this:
1. Nodes will be players, team names and positions
2. For each team, create a separate graph where each player node is connected to his team name node, connected to his teammates nodes and connected to his teammate position nodes.
3. Apply node2vec to the resulting graphs

*Notice: In order to create separate nodes for each position inside and between teams, I added suffixes to similar nodes and after the walk generation I have removed them. This is a technicality, inspect code in repo for better understanding

First rows of the input data looks like this (after some permutations):

如我们所见,每个位置都与其他位置连接,并且在播放每个位置时都会分配一个玩家。
有数十种不同的形式,并且它们之间的连通性也不同。 另外,有些位置在某些形式中却不存在,例如,“ LM”位置在此形式中不存在,而在其他形式中不存在。
这就是我们要做的事情:
1.节点将是玩家,团队名称和职位
2.对于每个团队,创建一个单独的图,其中每个玩家节点都连接到他的团队名称节点,连接到他的队友节点,并且连接到他的队友位置节点。
3.将node2vec应用于结果图
*注意:为了为团队内部和团队之间的每个位置创建单独的节点,我在类似的节点上添加了后缀,并在生成代步之后将其删除。 这是一种技术,请检查回购中的代码以更好地理解
输入数据的第一行如下所示(经过一些排列):

Sample rows from the input data

Then we construct the graph, using the FIFA17 formations.
Using my node2vec package the graph must be an instance of networkx.Graph.
Inspecting the graph edges after this, we will get the following

for edge in graph.edges:
print(edge)>>> ('james_rodriguez', 'real_madrid')
>>> ('james_rodriguez', 'cm_1_real_madrid')
>>> ('james_rodriguez', 'toni_kroos')
>>> ('james_rodriguez', 'cm_2_real_madrid')
>>> ('james_rodriguez', 'luka_modric')
>>> ('lw_real_madrid', 'cm_1_real_madrid')
>>> ('lw_real_madrid', 'lb_real_madrid')
>>> ('lw_real_madrid', 'toni_kroos')
>>> ('lw_real_madrid', 'marcelo')
...

As we can see, each player is connected to his team, the positions and teammates according to the formation.
All of the suffixes attached to the positions will be returned to their original string after the walks are computed ( lw_real_madrid→ lw).

So now that we have the graph, we execute node2vec

# pip install node2vecfrom node2vec import Node2Vec# Generate walks
node2vec = Node2Vec(graph, dimensions=20, walk_length=16, num_walks=100)# Reformat position nodes
fix_formatted_positions = lambda x: x.split('_')[0] if x in formatted_positions else xreformatted_walks = [list(map(fix_formatted_positions, walk)) for walk in node2vec.walks]node2vec.walks = reformatted_walks# Learn embeddings
model = node2vec.fit(window=10, min_count=1)

We give node2vec.Node2Vec a networkx.Graph instance, and after using .fit() (which accepts any parameter accepted by we get a gensim.models.Word2Vec) we get in return a gensim.models.Word2Vec instance.

First we will inspect the similarity between different nodes.
We expect the most similar nodes to a team, would be its teammates:

for node, _ in model.most_similar('real_madrid'):
print(node)>>> james_rodriguez
>>> luka_modric
>>> marcelo
>>> karim_benzema
>>> cristiano_ronaldo
>>> pepe
>>> gareth_bale
>>> sergio_ramos
>>> carvajal
>>> toni_kroos

For those who are not familiar with European football, these are all indeed Real Madrid’s players!

Next, we inspect similarities to a specific position. We would expect to get players playing in that position or near it at worse

# Right Wingers
for node, _ in model.most_similar('rw'):
# Show only players
if len(node) > 3:
print(node)>>> pedro
>>> jose_callejon
>>> raheem_sterling
>>> henrikh_mkhitaryan
>>> gareth_bale
>>> dries_mertens# Goal keepers
for node, _ in model.most_similar('gk'):
# Show only players
if len(node) > 3:
print(node)>>> thibaut_courtois
>>> gianluigi_buffon
>>> keylor_navas
>>> azpilicueta
>>> manuel_neuer

In the first try (right wingers) we indeed get different right wingers from different clubs, again a perfect match.
In the second try though, we get all goalkeepers except Azpilicueta which is actually a defender — this could be due to the fact that goalkeepers are not very connected to the team, only to central backs usually.

Works pretty good right? Just before we finish, lets use tSNE to reduce dimensionality and visualize the player nodes.

在第一次尝试(右路)中,我们确实从不同的俱乐部得到了不同的右路,这也是一次完美的比赛。
在第二次尝试中,我们得到了除阿兹皮里卡塔(Appilicueta)之外的所有守门员,而阿兹皮里库塔实际上是一名防守者-这可能是由于以下事实:守门员与球队的联系并不紧密,通常仅与中后卫联系。
效果不错吧? 在结束之前,让我们使用tSNE降低尺寸并可视化播放器节点。

Visualization of player nodes (tSNE reduced dimensionality)

Check it out, we get beautiful clusters based on the different clubs.


Final Words

Graph data is almost everywhere, and where its not you can usually put it on a graph yet the node2vec algorithm is not so popular.
The algorithm also grants great flexibility with its hyperparameters so you can decide which kind of information you wish to embed, and if you have the the option to construct the graph yourself (and is not a given) your options are limitless.
Hopefully you will find use in this article and have added a new tool to your machine learning arsenal.

If someone wants to contribute to my node2vec implementation, please contact me.

图数据几乎无处不在,通常您可以将其放在图上,但node2vec算法并不那么流行。
该算法的超参数还提供了极大的灵活性,因此您可以决定要嵌入哪种信息,并且如果可以选择自己构造图形(不是给定的),则选择是无限的。
希望您会在本文中找到用处,并已在您的机器学习工具库中添加了新工具。
如果有人想为我的node2vec实施做贡献,请与我联系。

Leave a Reply