The following Python script downloads the ogbn-arxiv
(i.e., arXiv CS MAG) knowledge graph from the Open Graph
Benchmark (OGB).
The ogbn-arxiv
dataset is described by OGB as
follows:
“The
ogbn-arxiv
dataset is a directed graph, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG [1]. Each node is an arXiv paper and each directed edge indicates that one paper cites another one. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are computed by running the skip-gram model [2] over the MAG corpus. We also provide the mapping from MAG paper IDs into the raw texts of titles and abstracts here. In addition, all papers are also associated with the year that the corresponding paper was published.”The task is to predict the 40 subject areas of arXiv CS papers, e.g., cs.AI, cs.LG, and cs.OS, which are manually determined (i.e., labeled) by the paper’s authors and arXiv moderators. With the volume of scientific publications doubling every 12 years over the past century, it is practically important to automatically classify each publication’s areas and topics. Formally, the task is to predict the primary categories of the arXiv papers, which is formulated as a 40-class classification problem.
import os
import pandas as pd
from ogb.nodeproppred import NodePropPredDataset
= NodePropPredDataset(name = "ogbn-arxiv")
dataset
= dataset.get_idx_split()
split_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
train_idx, valid_idx, test_idx = dataset[0] graph, label
Read the label ID to arXiv category and node ID to MAG paper ID mappings.
= pd.read_csv("dataset/ogbn_arxiv/mapping/labelidx2arxivcategeory.csv.gz", compression = "gzip")
label_to_category = pd.read_csv("dataset/ogbn_arxiv/mapping/nodeidx2paperid.csv.gz", compression = "gzip")
nodeid_to_paperid nodeid_to_paperid.head()
Clean arXiv category labels to conform to arXiv notation.
'arxiv category'] = label_to_category['arxiv category'].str.replace("arxiv ", "")
label_to_category['arxiv category'] = label_to_category['arxiv category'].str.replace(" ", ".")
label_to_category['arxiv category'] = label_to_category['arxiv category'].str.upper()
label_to_category[ label_to_category.head()
Read mapping from MAG paper ID to paper titles and abstracts using
file provided by OGB in the ogbn-arxiv
description.
= pd.read_csv("dataset/titleabs.tsv", sep = "\t", names = ["Node", "NodeTitle", "NodeAbstract"])
title_abs title_abs.head()
Construct the node list as a DataFrame
(see
graph.keys()
for accessing elements in the
graph
dictionary). Abstracts are not included to save
space. Map between OGB node ID and MAG paper ID; then, use the MAG paper
ID to retrieve the title and abstract. Also map between OGB category
label and arXiv category label.
= []
node_list_of_lists
for node_id in range(graph["num_nodes"]):
# get node values from graph
= label[node_id][0]
node_type = graph["node_year"][node_id][0]
node_year
# retrieve MAG ID
= nodeid_to_paperid.loc[nodeid_to_paperid["node idx"] == node_id, "paper id"].values[0]
node_magid
# get title and abstract
= title_abs.loc[title_abs["Node"] == node_magid, "NodeTitle"].values[0]
node_title # node_abstract = title_abs.loc[title_abs["Node"] == node_magid, "NodeAbstract"].values[0]
# get category
= label_to_category.loc[label_to_category["label idx"] == node_type, "arxiv category"].values[0]
node_category
# add row to data frame
# node_abstract,
node_list_of_lists.append([node_id, node_type, node_year, node_magid, node_title, node_category])
= pd.DataFrame(node_list_of_lists, columns=["Node", "NodeType", "NodeYear", "NodeMAGID", "NodeTitle", "NodeCategory"]) # "NodeAbstract",
node_list node_list.head()
Similarly, construct the edge list as a DataFrame
. In
addition to origin and destination node IDs, also retrieve origin and
destination node types.
= []
edge_list_of_lists
for edge_index in range(len(graph["edge_index"][0])):
# get edge values from graph
= graph["edge_index"][0][edge_index]
edge_origin = graph["edge_index"][1][edge_index]
edge_destination
# get origin and destination types
= label[edge_origin][0]
origin_type = label[edge_destination][0]
destination_type
# add row to data frame
edge_list_of_lists.append([edge_origin, edge_destination, origin_type, destination_type])
= pd.DataFrame(edge_list_of_lists, columns=["Origin", "Destination", "OriginType", "DestinationType"]) edge_list
Save node and edge lists as .csv
files to be read into
R.
# save node and edge lists as CSV files
"inst/extdata/arxiv_node_list.csv", index = False)
node_list.to_csv("inst/extdata/arxiv_edge_list.csv", index = False) edge_list.to_csv(
Read the .csv
node and edge lists saved previously.
# load libraries
library(data.table)
library(purrr)
library(magrittr)
# load metapaths library
library(metapaths)
# read data
arxiv_node_list = fread("inst/extdata/arxiv_node_list.csv")
arxiv_edge_list = fread("inst/extdata/arxiv_edge_list.csv")
head(arxiv_node_list)
Coerce types to character vectors as needed.
arxiv_node_list %>%
.[, Node := as.character(Node)] %>%
.[, NodeType := as.character(NodeType)]
arxiv_edge_list = map_dfc(arxiv_edge_list, as.character) %>%
as.data.table()
Find similar papers; in this case, nodes that share “meta path” in
the title. Use the meta path c(26, 26, 24)
, which
corresponds to CS.SI - CS.SI - CS.AI
.
# find nodes that share meta path
metapath_papers = arxiv_node_list[grep("meta path", NodeTitle), ]
metapath_papers[1:2, .(Node, NodeTitle, NodeCategory)]
Check similarity between two nodes.
# test similarity between nodes
sim_test = get_similarity(
"21500", "37908",
mp = c("26", "26", "24"),
node_list = arxiv_node_list,
edge_list = arxiv_edge_list,
verbose = T)