Mark Needham

Thoughts on Software Development

Neo4j/R: Analysing London NoSQL meetup membership

with 4 comments

In my spare time I’ve been working on a Neo4j application that runs on tops of meetup.com’s API and Nicole recently showed me how I could wire up some of the queries to use her Rneo4j library:

The query used in that visualisation shows the number of members that overlap between each pair of groups but a more interesting query is the one which shows the % overlap between groups based on the unique members across the groups.

The query is a bit more complicated than the original:

MATCH (group1:Group), (group2:Group)
OPTIONAL MATCH (group1)<-[:MEMBER_OF]-()-[:MEMBER_OF]->(group2)
 
WITH group1, group2, COUNT(*) as commonMembers
MATCH (group1)<-[:MEMBER_OF]-(group1Member)
 
WITH group1, group2, commonMembers, COLLECT(id(group1Member)) AS group1Members
MATCH (group2)<-[:MEMBER_OF]-(group2Member)
 
WITH group1, group2, commonMembers, group1Members, COLLECT(id(group2Member)) AS group2Members
WITH group1, group2, commonMembers, group1Members, group2Members
 
UNWIND(group1Members + group2Members) AS combinedMember
WITH DISTINCT group1, group2, commonMembers, combinedMember
 
WITH group1, group2, commonMembers, COUNT(combinedMember) AS combinedMembers
 
RETURN group1.name, group2.name, toInt(round(100.0 * commonMembers / combinedMembers)) AS percentage		 
ORDER BY group1.name, group1.name

The next step is to wire that up to use Rneo4j and ggplot2. First we’ll get the libraries installed and loaded:

install.packages("devtools")
devtools::install_github("nicolewhite/Rneo4j")
install.packages("ggplot2")
 
library(Rneo4j)
library(ggplot2)

And now we’ll execute the query and create a chart from the results:

graph = startGraph("http://localhost:7474/db/data/")
 
query = "MATCH (group1:Group), (group2:Group)
         WHERE group1 <> group2
         OPTIONAL MATCH p = (group1)<-[:MEMBER_OF]-()-[:MEMBER_OF]->(group2)
         WITH group1, group2, COLLECT(p) AS paths
         RETURN group1.name, group2.name, LENGTH(paths) as commonMembers
         ORDER BY group1.name, group2.name"
 
group_overlap = cypher(graph, query)
 
ggplot(group_overlap, aes(x=group1.name, y=group2.name, fill=commonMembers)) + 
geom_bin2d() +
geom_text(aes(label = commonMembers)) +
labs(x= "Group", y="Group", title="Member Group Member Overlap") +
scale_fill_gradient(low="white", high="red") +
theme(axis.text = element_text(size = 12, color = "black"),
      axis.title = element_text(size = 14, color = "black"),
      plot.title = element_text(size = 16, color = "black"),
      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
 
// as percentage
 
query = "MATCH (group1:Group), (group2:Group)
         WHERE group1 <> group2
         OPTIONAL MATCH path = (group1)<-[:MEMBER_OF]-()-[:MEMBER_OF]->(group2)
 
         WITH group1, group2, COLLECT(path) AS paths
 
         WITH group1, group2, LENGTH(paths) as commonMembers
         MATCH (group1)<-[:MEMBER_OF]-(group1Member)
 
         WITH group1, group2, commonMembers, COLLECT(id(group1Member)) AS group1Members
         MATCH (group2)<-[:MEMBER_OF]-(group2Member)
 
         WITH group1, group2, commonMembers, group1Members, COLLECT(id(group2Member)) AS group2Members
         WITH group1, group2, commonMembers, group1Members, group2Members
 
         UNWIND(group1Members + group2Members) AS combinedMember
         WITH DISTINCT group1, group2, commonMembers, combinedMember
 
         WITH group1, group2, commonMembers, COUNT(combinedMember) AS combinedMembers
 
         RETURN group1.name, group2.name, toInt(round(100.0 * commonMembers / combinedMembers)) AS percentage
 
         ORDER BY group1.name, group1.name"
 
group_overlap = cypher(graph, query)
 
ggplot(group_overlap, aes(x=group1.name, y=group2.name, fill=percentage)) + 
  geom_bin2d() +
  geom_text(aes(label = percentage)) +
  labs(x= "Group", y="Group", title="Member Group Member Overlap") +
  scale_fill_gradient(low="white", high="red") +
  theme(axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 14, color = "black"),
        plot.title = element_text(size = 16, color = "black"),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
2014 05 31 21 54 42

A first glance at the visualisation suggests that the Hadoop, Data Science and Big Data groups have the most overlap which seems to make sense as they do cover quite similar topics.

Thanks to Nicole for the library and the idea of the visualisation. Now we need to do some more analysis on the data to see if there are any more interesting insights.

Be Sociable, Share!

Written by Mark Needham

May 31st, 2014 at 9:32 pm

Posted in R

Tagged with ,

  • Nicole White

    Awesome!!!

    I think I am going to attempt a cluster analysis of the groups based on their shared topics using one of R’s clustering functions, just to see if the clustering coincides with the groups that have high co-membership.

  • http://www.markhneedham.com/blog Mark Needham

    @disqus_M81U6DngQK:disqus good idea! Each user has multiple ‘INTERESTED_IN’ relationships going out to topics and the meetup groups have a ‘HAS_TOPIC’ relationship to the topic as well.

  • Jean Adams

    The visualization would benefit from the rows and columns of the matrix being sorted in some more pattern-informative way than alphabetical. See, for example the seriation package in R.

  • http://www.markhneedham.com/blog Mark Needham

    Hi @Jean, I hadn’t come across seriation before but I’m going to give that a try and then I’ll re-post hopefully with a better looking visualisation.