We demonstrate how to use CSV data to build a knowledge graph and then learn node embeddings that capture the similarity between individuals. The process involves two major stages: First, we construct an undirected knowledge graph from our CSV files, where nodes represent people and their attributes, and edges capture relationships weighted by their importance. Then we learn vector embeddings for these nodes and reposition the person nodes in 2D space so that similar individuals are physically closer together, representing their mathematically derived relationships.
Here is a sample data set that we will use for the graphs that follow. data.csv
contains details such as name, education, employer, location, hobbies, etc.,
while weights.csv
defines the relative importance of each field.
name | undergrad | grad | employer | location | hobbies | drinks | |
---|---|---|---|---|---|---|---|
0 | John | MIT | Yale | Colorado | skiing | espresso | |
1 | Janice | nan | nan | Freelancer | California | skiing | coldbrew |
2 | Alice | Stanford | Northeastern | Massachusetts | cycling;reading | tea | |
3 | Bob | Harvard | nan | Amazon | Massachusetts | running;cooking | rum |
4 | Charlie | MIT | Harvard | Washington | skiing | espresso | |
5 | Joe | BC | nan | Amazon | Massachusetts | running;cooking | rum |
6 | Chuck | BU | Harvard | Washington | skiing | espresso | |
7 | Aime | BU | nan | Kohls | Florida | TV | water |
field | weight | |
---|---|---|
0 | undergrad | 5.000000 |
1 | grad | 5.000000 |
2 | employer | 5.000000 |
3 | location | 10.000000 |
4 | hobbies | 3.000000 |
5 | drinks | 0.500000 |
Using the information from data.csv
and weights.csv
, we create an undirected graph.
In this graph, each person and each trait (e.g., school, employer, hobby) is represented as a node.
Edges connect people to their corresponding traits, with the edge weights reflecting the importance of that relationship. The physical distance between nodes is arbitrary, at this point, and closeness can only be understood by examining node degrees.
After constructing the graph, we apply a machine learning process to learn vector embeddings for every node. These embeddings capture the similarity between nodes – so that individuals with similar attributes have embeddings that are close together. We then reposition the person nodes using Multi-Dimensional Scaling (MDS), resulting in a 2D layout where similar people are closer. Hovering over any person node in the graph reveals a list of other individuals sorted by similarity scores, with 1 being identical. In this example, where we put a heavy weight on location, we see Alice Joe and Bob, who all live in the same state, are very close together. Aime, who only shares a trait with Chuck, is far away from the pack.
In summary, this visualization demonstrates the complete pipeline: CSV Data → Knowledge Graph → Node Embeddings → Interactive Visualization. The undirected graph shows the raw connections between people and their traits, while the embeddings graph provides a refined view based on learned similarities. Hover over nodes to explore the detailed similarity metrics.