Neo4j Graph Data Science 1.5: Exploring the HITS Algorithm
The Neo4j Graph Data Science Library provides efficiently implemented, parallel versions of common graph algorithms for Neo4j, exposed as Cypher procedures. It recently published version 1.5, which has lots of goodies to play with.
In this blog post, we’re going to explore the newly added HITS algorithm with the help of a citations dataset.
Launching Neo4j
We’re going to run Neo4j with the Graph Data Science Library using the following Docker Compose configuration:
version: '3.7'
services:
neo4j:
image: neo4j:4.2.3enterprise
container_name: "neo4j4.2gds1.5exploration"
volumes:
 ./plugins4.2:/plugins
 ./data4.2:/data
ports:
 "7474:7474"
 "7687:7687"
environment:
 NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
 NEO4J_AUTH=neo4j/neo
 NEO4JLABS_PLUGINS=["apoc", "graphdatascience"]
If you want to follow along with the examples used in the blog post, you can copy the configuration above to a file titled dockercompose.yml
We can now launch the Neo4j server by running the following command:
dockercompose up
After we’ve launched that command, we need to wait until we see the following output:
neo4j4.2gds1.5exploration  20210203 21:39:15.346+0000 INFO Bolt enabled on 0.0.0.0:7687.
neo4j4.2gds1.5exploration  20210203 21:39:16.053+0000 INFO Remote interface available at http://localhost:7474/
neo4j4.2gds1.5exploration  20210203 21:39:16.053+0000 INFO Started.
Importing the Citation Graph
We’re going to import the Citations Graph from the Using a Machine Learning Workflow for Link Prediction online training.
Let’s first connect to Neo4j using the Cypher Shell:
docker exec it neo4j4.2gds1.5exploration bin/cyphershell u neo4j p neo
Connected to Neo4j using Bolt protocol version 4.2 at neo4j://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j@neo4j>
The data for the citation graph is available as a set of JSON lines files, which we can find at https://github.com/mneedham/linkprediction/tree/master/data. Below is an example of one line of one of the files:
{"authors": ["Tegegne Marew", "DooHwan Bae"], "n_citation": 1, "references": ["2134bf3bfd89472490ce5993b4fa3218", "906c17e0db09407bb76041df5a3f0293", "94f4382ecfa64aec92b83711fc55da54", "9f1725858d424fceb6aeaede321f3fd4", "a3aee287efd04b9d9cdad47dd192c9f4", "a9a7fd07ef714b3c8fcfd7fe114d2148", "d63dd4ae4b30484b8ffc88d21839ddad"], "title": "Using Classpects for Integrating NonFunctional and Functional Requirements.", "venue": "international conference on software engineering", "year": 2006, "id": "01f1d23180ae4cceb56c9d821e0924d0"}
We’re going to use some APOC procedures to convert these JSON files into the following graph structure:
So we’ll have the following node labels:

Article
 a paper on a topic 
Venue
 where the paper was presented 
Author
 the person/people that worked on theArticle
And these relationship types:

CITED
 which articles were cited by an article 
AUTHOR
 which articles were written by a person/people 
VENUE
 where an article was presented
Let’s first create a new database and setup some constraints to make sure we don’t end up with duplicates:
CREATE OR REPLACE DATABASE citationsblogpost;
:use citationsblogpost;
CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE;
CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE;
CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE;
Now we’ll import the data:
CALL apoc.periodic.iterate(
'UNWIND ["dblpref0.json", "dblpref1.json", "dblpref2.json", "dblpref3.json"] AS file
CALL apoc.load.json("https://github.com/mneedham/linkprediction/raw/master/data/" + file)
YIELD value WITH value
return value',
'MERGE (a:Article {index:value.id})
SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0])
WITH a, value.authors as authors, value.references AS citations, value.venue AS venue
MERGE (v:Venue {name: venue})
MERGE (a)[:VENUE]>(v)
FOREACH(author in authors 
MERGE (b:Author{name:author})
MERGE (a)[:AUTHOR]>(b))
FOREACH(citation in citations 
MERGE (cited:Article {index:citation})
MERGE (a)[:CITED]>(cited))',
{batchSize: 1000, iterateList: true})
YIELD batches, total, timeTaken, committedOperations;
batches  total  timeTaken  committedOperations 

52 
51956 
21 
51956 
And finally, a bit of cleanup to remove articles that don’t have a title:
MATCH (a:Article)
WHERE not(exists(a.title))
DETACH DELETE a;
HITS Algorithm
The HITs algorithm, like many other graph algorithms, was invented to do link analysis on web pages. It is a centrality algorithm, which means that it indicates node importance based on some metric. We can learn more about it from the HITS Wikipedia page:
The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages.
The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.
So a page with a high authority score has high value content, whereas a page with a high hub score links out to important pages.
We’re going to use this algorithm to analyse the citations between articles in our graph, so what does those different scores mean for us?

An article with a high authority score will likely have a lot of citations, perhaps some of those by other important articles

An article with a high hub score can help direct us (via its citations) to the important articles. It’s not clear to me that the hub score makes so much sense in this graph because there aren’t really articles written with the intention of pointing people towards a bunch of other articles!
Let’s give the algorithm a try and see what we find. We can return a list of the available procedures by running the following query:
CALL gds.list("hits")
YIELD name, description
RETURN name, description;
name  description 

"gds.alpha.hits.mutate" 
"HyperlinkInduced Topic Search (HITS) is a link analysis algorithm that rates nodes" 
"gds.alpha.hits.mutate.estimate" 
"Returns an estimation of the memory consumption for that procedure." 
"gds.alpha.hits.stats" 
"HyperlinkInduced Topic Search (HITS) is a link analysis algorithm that rates nodes" 
"gds.alpha.hits.stats.estimate" 
"Returns an estimation of the memory consumption for that procedure." 
"gds.alpha.hits.stream" 
"HyperlinkInduced Topic Search (HITS) is a link analysis algorithm that rates nodes" 
"gds.alpha.hits.stream.estimate" 
"Returns an estimation of the memory consumption for that procedure." 
"gds.alpha.hits.write" 
"HyperlinkInduced Topic Search (HITS) is a link analysis algorithm that rates nodes" 
"gds.alpha.hits.write.estimate" 
"Returns an estimation of the memory consumption for that procedure." 
Before we run the algorithm, we’ll create a projected graph called citation_graph
, by running the following:
CALL gds.graph.create("citation_graph", "Article", "CITED");
nodeProjection  relationshipProjection  graphName  nodeCount  relationshipCount  createMillis 

{Article: {properties: {}, label: "Article"}} 
{CITED: {orientation: "NATURAL", aggregation: "DEFAULT", type: "CITED", properties: {}}} 
"citation_graph" 
51956 
28706 
149 
And now we’ll run the write version of the algorithm against the projected graph:
CALL gds.alpha.hits.write("citation_graph", {
hitsIterations: 20
})
YIELD writeMillis, nodePropertiesWritten, ranIterations, postProcessingMillis, createMillis, computeMillis;
writeMillis  nodePropertiesWritten  ranIterations  postProcessingMillis  createMillis  computeMillis 

174 
103912 
81 
0 
3 
390 
By default, this procedure will create pregel_auth
and pregel_hub
properties on each of the Article
nodes storing the computed scores.
Analysing authority scores
Let’s see which articles rank highest, starting with authority:
MATCH (a:Article)
RETURN a.title, a.year, substring(a.abstract, 0, 300) AS abstract,
[(a)[:AUTHOR]>(auth)  auth.name] AS authors,
round(a.pregel_auth, 3) AS auth
ORDER BY auth DESC
LIMIT 10;
a.title  a.year  abstract  authors  auth 

"Rough sets" 
1995 
"Rough set theory, introduced by Zdzislaw Pawlak in the early 1980s [11, 12], is a new mathematical tool to deal with vagueness and uncertainty. This approach seems to be of fundamental importance to artificial intelligence (AI) and cognitive sciences, especially in the areas of machine learning, kno" 
["Jerzy W. GrzymalaBusse", "Wojciech Ziarko", "Zdzisław Pawlak", "Roman Słowiński"] 
0.99 
"Fuzzy Similarity Relation as a Basis for Rough Approximations" 
1998 
"The rough sets theory proposed by Pawlak was originally founded on the idea of approximating a given set by means of indiscernibility binary relation, which was assumed to be an equivalence relation (reflexive, symmetric and transitive). With respect to this basic idea, two main theoretical developm" 
["Roman Słowiński", "Salvatore Greco", "Benedetto Matarazzo"] 
0.042 
"Toward Intelligent Systems: Calculi of Information Granules" 
2001 
"We present an approach based on calculi of information granules as a basis for approximate reasoning in intelligent systems. Approximate reasoning schemes are defined by means of information granule construction schemes satisfying some robustness constraints. In distributed environments such schemes" 
["Andrzej Skowron"] 
0.042 
"Approximation spaces and information granulation" 
2005 
"In this paper, we discuss approximation spaces in a granular computing framework. Such approximation spaces generalise the approaches to concept approximation existing in rough set theory. Approximation spaces are constructed as higher level information granules and are obtained as the result of com" 
["Andrzej Skowron", "Piotr Synak", "Roman Świniarski"] 
0.038 
"Layered learning for concept synthesis" 
2004 
"We present a hierarchical scheme for synthesis of concept approximations based on given data and domain knowledge. We also propose a solution, founded on rough set theory, to the problem of con structing the approximation of higher level concepts by composing the approximation of lower level concep" 
["Andrzej Skowron", "Jan G. Bazan", "Hung Son Nguyen", "Sinh Hoa Nguyen"] 
0.037 
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" 
2000 
"In the paper nine different approaches to missing attribute values are presented and compared. Ten input data files were used to investigate the performance of the nine methods to deal with missing attribute values. For testing both naive classification and new classification techniques of LERS (Lea" 
["Jerzy W. GrzymalaBusse", "Ming Hu"] 
0.036 
"Variable Consistency Model of DominanceBased Rough Sets Approach" 
2000 
"Consideration of preferenceorders requires the use of an extended rough set model called Dominancebased Rough Set Approach (DRSA). The rough approximations defined within DRSA are based on consistency in the sense of dominance principle. It requires that objects having notworse evaluation with re" 
["Benedetto Matarazzo", "Salvatore Greco", "Roman Słowiński", "Jerzy Stefanowski"] 
0.029 
"RSES and RSESlib  A Collection of Tools for Rough Set Computations" 
2000 
"Rough Set Exploration System  a set of software tools featuring a library of methods and a graphical user interface is presented. Methods, features and abilities of the implemented software are discussed and illustrated with a case study in data analysis." 
["Marcin S. Szczuka", "Jan G. Bazan"] 
0.026 
"A New Version of Rough Set Exploration System" 
2002 
"We introduce a new version of the Rough Set Exploration System  a software tool featuring a library of methods and a graphical user interface supporting variety of roughsetbased computations. Methods, features and abilities of the implemented software are discussed and illustrated with a case stu" 
["Marcin S. Szczuka", "Jakub Wróblewski", "Jan G. Bazan"] 
0.026 
"Rough sets and information granulation" 
2003 
"In this paper, the study of the evolution of approximation space theory and its applications is considered in the context of rough sets introduced by Zdzislaw Pawlak and information granulation as well as computing with words formulated by Lotfi Zadeh. Central to this evolution is the roughmereolog" 
["Piotr Synak", "James F. Peters", "Andrzej Skowron", "Sheela Ramanna"] 
0.026 
The top article by some distance on this metric is "Rough sets", which was written more than 25 years ago. I found it interesting that the abstract talks about it being an approach that is fundamental to AI and machine learning, which are important fields in 2021.
We can have a look at the hub nodes that point to these articles by running the following query:
MATCH (a:Article)
WITH a, [(a)<[:CITED](other)  other] AS citations
WITH a, apoc.coll.sortNodes(citations, "pregel_hub")[..5] AS topHubs
RETURN a.title, a.year,
round(a.pregel_auth, 3) AS auth,
[c in topHubs  {article: c.title, score: round(c.pregel_hub, 3)}] AS topHubs
ORDER BY auth DESC
LIMIT 10;
a.title  a.year  auth  topHubs 

"Rough sets" 
1995 
0.99 
[{score: 0.083, article: "Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "Multimodal classification: case studies"}] 
"Fuzzy Similarity Relation as a Basis for Rough Approximations" 
1998 
0.042 
[{score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Lattices with Interior and Closure Operators and Abstract Approximation Spaces"}] 
"Toward Intelligent Systems: Calculi of Information Granules" 
2001 
0.042 
[{score: 0.083, article: "Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.072, article: "Rough sets and information granulation"}, {score: 0.071, article: "A Note on Ziarko’s Variable Precision Rough Set Model and Nonmonotonic Reasoning"}, {score: 0.071, article: "A Partition Model of Granular Computing"}] 
"Approximation spaces and information granulation" 
2005 
0.038 
[{score: 0.083, article: "Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Matching 2d image segments with genetic algorithms and approximation spaces"}] 
"Layered learning for concept synthesis" 
2004 
0.037 
[{score: 0.083, article: "Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "Multimodal classification: case studies"}, {score: 0.072, article: "P300 wave detection based on rough sets"}] 
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" 
2000 
0.036 
[{score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.075, article: "The rough set exploration system"}, {score: 0.071, article: "Missing template decomposition method and its implementation in rough set exploration system"}, {score: 0.07, article: "Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction"}, {score: 0.07, article: "Characteristic relations for incomplete data: a generalization of the indiscernibility relation"}] 
"Variable Consistency Model of DominanceBased Rough Sets Approach" 
2000 
0.029 
[{score: 0.072, article: "Rough Set Analysis of PreferenceOrdered Data"}, {score: 0.072, article: "Variableprecision dominancebased rough set approach"}, {score: 0.071, article: "On variable consistency dominancebased rough set approaches"}, {score: 0.071, article: "Multicriteria choice and ranking using decision rules induced from rough approximation of graded preference relations"}, {score: 0.07, article: "Rough set approach to customer satisfaction analysis"}] 
"RSES and RSESlib  A Collection of Tools for Rough Set Computations" 
2000 
0.026 
[{score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.073, article: "Hybridization of rough sets and statistical learning theory"}, {score: 0.072, article: "Ontology driven concept approximation"}, {score: 0.072, article: "Processing of musical data employing rough sets and artificial neural networks"}, {score: 0.069, article: "A statistical method for determining importance of variables in an information system"}] 
"A New Version of Rough Set Exploration System" 
2002 
0.026 
[{score: 0.075, article: "Multimodal classification: case studies"}, {score: 0.072, article: "Processing of musical data employing rough sets and artificial neural networks"}, {score: 0.069, article: "Introducing a rule importance measure"}, {score: 0.069, article: "NetTRS induction and postprocessing of decision rules"}, {score: 0.069, article: "Classification of Swallowing Sound Signals: A Rough Set Approach"}] 
"Rough sets and information granulation" 
2003 
0.026 
[{score: 0.083, article: "Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Matching 2d image segments with genetic algorithms and approximation spaces"}, {score: 0.071, article: "Time complexity of decision trees"}] 
Based on the top hubs, it’s not really obvious why the authority score for "Rough sets" is so much higher than the other articles. Perhaps if we return the max, min, and average hub scores we’ll be able to figure it out?
MATCH (a:Article)
WITH a, [(a)<[:CITED](other)  other] AS citations
RETURN a.title, a.year,
round(a.pregel_auth, 3) AS auth,
round(apoc.coll.max([c in citations  c.pregel_hub]), 3) AS maxHub,
round(apoc.coll.min([c in citations  c.pregel_hub]), 3) AS minHub,
round(apoc.coll.avg([c in citations  c.pregel_hub]), 3) AS averageHub,
size(citations) AS citations
ORDER BY auth DESC
LIMIT 10;
a.title  a.year  auth  maxHub  minHub  averageHub  citations 

"Rough sets" 
1995 
0.99 
0.083 
0.068 
0.069 
211 
"Toward Intelligent Systems: Calculi of Information Granules" 
2001 
0.042 
0.083 
0.003 
0.036 
17 
"Fuzzy Similarity Relation as a Basis for Rough Approximations" 
1998 
0.042 
0.082 
0.003 
0.061 
10 
"Approximation spaces and information granulation" 
2005 
0.038 
0.083 
0.005 
0.055 
10 
"Layered learning for concept synthesis" 
2004 
0.037 
0.083 
0.003 
0.05 
11 
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" 
2000 
0.036 
0.082 
0.002 
0.048 
11 
"Variable Consistency Model of DominanceBased Rough Sets Approach" 
2000 
0.029 
0.072 
0.004 
0.061 
7 
"RSES and RSESlib  A Collection of Tools for Rough Set Computations" 
2000 
0.026 
0.079 
0.002 
0.032 
12 
"Rough sets and information granulation" 
2003 
0.026 
0.083 
0.005 
0.065 
6 
"A New Version of Rough Set Exploration System" 
2002 
0.026 
0.075 
0.002 
0.038 
10 
From this output we learn that "Rough sets" is being cited by a lot of articles with a good hub score.
The other articles have a similar maxHub
score and some even have a similar averageHub
, but their minHub
is significantly less.
It also has 10x as many citations as any of the other articles in the top 10, so that would contribute to the higher score as well.
HITS Authority vs PageRank
The HITS Authority score and the PageRank algorithm both compute scores that indicate the importance of a node in a graph, so I was curious whether there was any correlation between the scores. i.e. do the nodes with the highest HITS authority score also have a high PageRank score?
To recap, this is what PageRank measures:
The PageRank algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes.
We can compute the PageRank score for articles, by running the following query:
CALL gds.pageRank.write("citation_graph", {
maxIterations: 20,
writeProperty: "pagerank"
})
YIELD writeMillis, nodePropertiesWritten, ranIterations, postProcessingMillis, createMillis, computeMillis;
writeMillis  nodePropertiesWritten  ranIterations  postProcessingMillis  createMillis  computeMillis 

29 
51956 
20 
0 
0 
112 
And now let’s put the PageRank scores alongside the HITS Authority scores:
MATCH (a:Article)
RETURN a.title, a.year,
round(a.pregel_auth, 3) AS auth,
round(a.pagerank, 3) AS pagerank,
size([(a)<[:CITED](other)  other]) AS citations
ORDER BY auth DESC
LIMIT 10;
a.title  a.year  auth  pagerank  citations 

"Rough sets" 
1995 
0.99 
25.609 
211 
"Fuzzy Similarity Relation as a Basis for Rough Approximations" 
1998 
0.042 
0.738 
10 
"Toward Intelligent Systems: Calculi of Information Granules" 
2001 
0.042 
1.862 
17 
"Approximation spaces and information granulation" 
2005 
0.038 
0.418 
10 
"Layered learning for concept synthesis" 
2004 
0.037 
0.505 
11 
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" 
2000 
0.036 
0.896 
11 
"Variable Consistency Model of DominanceBased Rough Sets Approach" 
2000 
0.029 
0.471 
7 
"RSES and RSESlib  A Collection of Tools for Rough Set Computations" 
2000 
0.026 
1.296 
12 
"A New Version of Rough Set Exploration System" 
2002 
0.026 
0.682 
10 
"Rough sets and information granulation" 
2003 
0.026 
0.375 
6 
Rough Sets is the only one with a high PageRank score as well. In fact, its PageRank score is the 3rd highest in the graph, which we can see by running the following query:
MATCH (a:Article)
RETURN a.title, a.year,
round(a.pregel_auth, 5) AS auth,
round(a.pagerank, 5) AS pagerank,
size([(a)<[:CITED](other)  other]) AS citations
ORDER BY pagerank DESC
LIMIT 10;
a.title  a.year  auth  pagerank  citations 

"A method for obtaining digital signatures and publickey cryptosystems" 
1978 
5.0E5 
93.94313 
125 
"Secure communications over insecure channels" 
1978 
0.0 
79.86924 
7 
"Rough sets" 
1995 
0.9902 
25.60911 
211 
"An axiomatic basis for computer programming" 
1969 
4.4E4 
23.02937 
93 
"Pastry: Scalable, Decentralized Object Location, and Routing for LargeScale PeertoPeer Systems" 
2001 
0.0 
21.46956 
108 
"SCRIBE: The Design of a LargeScale Event Notification Infrastructure" 
2001 
0.0 
19.4863 
14 
"A field study of the software design process for large systems" 
1988 
0.0 
19.02815 
53 
"Productivity factors and programming environments" 
1984 
0.0 
18.49935 
5 
"Analyzing mediumscale software development" 
1978 
0.0 
16.45275 
5 
"A Calculus of Communicating Systems" 
1982 
0.0 
15.43059 
55 
I find it kinda interesting that while these articles have very high transitive importance, their HITS Authority score is very low. Many of them have a lot of citations as well, but presumably most of those citations aren’t from hub nodes.
Analysing hub scores
Speaking of hubs, let’s explore those in a bit more detail. We can find the articles with the highest hub score, by running the following query:
MATCH (a:Article)
WITH a, [(a)[:CITED]>(other)  other] AS cited
RETURN a.title, a.year,
round(a.pregel_hub, 3) AS hub,
round(apoc.coll.max([c in cited  c.pregel_auth]), 3) AS maxAuth,
round(apoc.coll.min([c in cited  c.pregel_auth]), 3) AS minAuth,
round(apoc.coll.avg([c in cited  c.pregel_auth]), 3) AS averageAuth,
size(cited) AS cited
ORDER BY a.pregel_hub DESC
LIMIT 10;
a.title  a.year  hub  maxAuth  minAuth  averageAuth  cited 

"Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces" 
2005 
0.083 
0.99 
0.006 
0.102 
12 
"Some Issues on Rough Sets" 
2004 
0.082 
0.99 
0.006 
0.134 
9 
"A treatise on rough sets" 
2005 
0.079 
0.99 
0.005 
0.145 
8 
"Approximate boolean reasoning: foundations and applications in data mining" 
2006 
0.079 
0.99 
0.005 
0.115 
10 
"Multimodal classification: case studies" 
2006 
0.075 
0.99 
0.005 
0.122 
9 
"The rough set exploration system" 
2005 
0.075 
0.99 
0.005 
0.157 
7 
"On generalized rough fuzzy approximation operators" 
2006 
0.075 
0.99 
0.026 
0.274 
4 
"Lattices with Interior and Closure Operators and Abstract Approximation Spaces" 
2009 
0.074 
0.99 
0.005 
0.136 
8 
"Matching 2d image segments with genetic algorithms and approximation spaces" 
2006 
0.074 
0.99 
0.005 
0.154 
7 
"Hybridization of rough sets and statistical learning theory" 
2011 
0.073 
0.99 
0.005 
0.214 
5 
The maxAuth
scores tell us that all of these articles cite the "Rough sets" article that we came across in the previous section.
There aren’t really any other articles with a high authority score, so we can assume that nearly all of the hub score is coming from citing "Rough sets".
In any case, let’s have a look at the other authorities that these articles have cited:
MATCH (a:Article)
WITH a, [(a)[:CITED]>(other)  other] AS cited
WITH a, apoc.coll.sortNodes(cited, "pregel_auth")[..5] AS topAuthorities
RETURN a.title, a.year,
round(a.pregel_hub, 3) AS hub,
[c in topAuthorities  {article: c.title, score: round(c.pregel_auth, 3)}] AS topAuthorities
ORDER BY hub DESC
LIMIT 10;
a.title  a.year  hub  topAuthorities 

"Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces" 
2005 
0.083 
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Toward Intelligent Systems: Calculi of Information Granules"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "Rough sets and information granulation"}] 
"Some Issues on Rough Sets" 
2004 
0.082 
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Toward Intelligent Systems: Calculi of Information Granules"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.036, article: "A Comparison of Several Approaches to Missing Attribute Values in Data Mining"}] 
"A treatise on rough sets" 
2005 
0.079 
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "Rough sets and information granulation"}] 
"Approximate boolean reasoning: foundations and applications in data mining" 
2006 
0.079 
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "RSES and RSESlib  A Collection of Tools for Rough Set Computations"}, {score: 0.021, article: "Some Issues on Rough Sets"}] 
"Multimodal classification: case studies" 
2006 
0.075 
[{score: 0.99, article: "Rough sets"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "A New Version of Rough Set Exploration System"}, {score: 0.015, article: "The rough set exploration system"}, {score: 0.01, article: "Rough Set Methods in Approximation of Hierarchical Concepts"}] 
"The rough set exploration system" 
2005 
0.075 
[{score: 0.99, article: "Rough sets"}, {score: 0.036, article: "A Comparison of Several Approaches to Missing Attribute Values in Data Mining"}, {score: 0.021, article: "Rough Sets and Decision Algorithms"}, {score: 0.021, article: "In Pursuit of Patterns in Data Reasoning from Data The Rough Set Way"}, {score: 0.015, article: "Classification of Swallowing Sound Signals: A Rough Set Approach"}] 
"On generalized rough fuzzy approximation operators" 
2006 
0.075 
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "Rough sets and information granulation"}] 
"Lattices with Interior and Closure Operators and Abstract Approximation Spaces" 
2009 
0.074 
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.024, article: "Approximation Operators in Qualitative Data Analysis"}, {score: 0.015, article: "Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction"}, {score: 0.005, article: "Algebraic structures for rough sets"}] 
"Matching 2d image segments with genetic algorithms and approximation spaces" 
2006 
0.074 
[{score: 0.99, article: "Rough sets"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "Rough sets and information granulation"}, {score: 0.01, article: "Kmeans Indiscernibility Relation over Pixels"}, {score: 0.006, article: "Rough ethology: towards a biologicallyinspired study of collective behavior in intelligent systems with approximation spaces"}] 
"Hybridization of rough sets and statistical learning theory" 
2011 
0.073 
[{score: 0.99, article: "Rough sets"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "RSES and RSESlib  A Collection of Tools for Rough Set Computations"}, {score: 0.01, article: "Accuracy and Coverage in Rough Set Rule Induction"}, {score: 0.005, article: "Generalized indiscernibility relations: applications for missing values and analysis of structural objects"}] 
The top 2 articles both cited "Toward Intelligent Systems: Calculi of Information Granules", which gives them a marginally higher score than the other 8. But I don’t think these hub scores are telling us all that much about these articles.
In Summary
While I’m not sure that this is the greatest data set to show off this algorithm, I think the algorithm itself is an interesting addition to the library. I’m curious to see how well it would fare on a Twitter graph  perhaps the HITS Hub score would help to identify those accounts that primarily tweet out links to interesting content? I guess that exploration will have to wait for another post!
About the author
I'm currently working on realtime userfacing analytics with Apache Pinot at StarTree. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also I coauthored the O'Reilly Graph Algorithms Book with Amy Hodler.