Analysing Twitter Data with Neo4j

As part of a larger project we’ve been working on to build a simulation of social influence (more on that in a later post) we’ve been exploring ways of ingesting and analysing Twitter data.

There are a lot of moving parts to this task, of course, but one of the more interesting ones is how best to store and query the data. Given that Twitter is a part of the social graph, it made sense to us to take a look at what graph databases, such as Neo4J, have to offer.

Graph databases are particularly well suited to capturing social data because they store data as nodes and edges. In our Neo4J Twitter database we have four kinds of node: Twitter users, tweets, retweets and mentions.

The coding rubric goes like this:

  • If a user tweets a new tweet, this is stored as a user node, a tweet node, and a ‘tweeting’ relationship between the nodes
  • If someone retweets the original tweet, the retweet and the person tweeting it are added as new nodes, and the retweeting user is connected via the tweet relationship to the retweet, and the retweet is connected via the retweet relationship to the original tweet
  • If a user mentions another user, the mention tweet is connected to both the user that tweets it and the user mentioned

This gives us a very powerful and economical way of harvesting the social graph. This becomes clear when you visualise what’s in the database (which we’ve done below using Gephi):

The graph

Full File in PDF Form

This graph represents Twitter activity around the phrase ‘Future is Here’ (inspired by the exhibition at the Design Museum supported by the Technology Strategy Board), collected via the Twitter API between 22 October and 13 November 2013.

  • Green nodes (with names) are Twitter Users
  • Red nodes are tweets
  • Purple are retweets
  • Blue are mentions (or replies)

The size of a node is simply a function of how many relationships it has (it’s not an intrinsic property of the node). There are 13,631 nodes here (users and tweets of various kinds) and 25,966 edges.

Looking at the graph as a whole, it’s striking how much Twitter activity around ‘Future is Here’ takes place in isolation. (Of course, followers may be reading ‘Future is Here’ tweets, but this doesn’t leave a trace for us to pick up.) Then, as we zoom in, we start to see some distinctive patterns of connected activity. We’ll take a closer look at just 6, and then talk briefly about how Neo4j helps can help us with finding and quantifying such patterns.

1. Users getting retweeted ‘blindly’


Here’s PerezHilton, a celebrity blogger with lots of followers, some of whom are prepared to retweet anything he says, seemingly regardless of content.

2. Users getting retweeted and setting off additional tweets and replies


This is StacyCowley, an NYT journalist with 2354 followers, who had something interesting to say about her job:

“The future is here: NYT stylebook just dropped hyphen from email.”

This was picked up and retweeted beyond her list of followers.

3. Users tweeting repeatedly without getting retweeted


There are quite a few of these. They may be automated ‘bots’ tweeting to get profile. MountainMutts, for example, bills itself as the:

“#1 Dog Grooming Boutique in the Land of the Waterfalls-Brevard, NC. We’re also a comprehensive online company with Newsletter, Facebook, valuable content…”

The repeated tweet we see here is the unremarkable:

“iPas – PROVEN income online. We Tested It. The Future is Here. The Future is iPas – Get Your Trial NOW HERE”

5. Users being retweeted by closely related followers


6. Hybrid patterns


Reading her tweets, this user looks like a bot. The main tweet we see here is:

“Law: The Future is Here :”

The difference with other bots is that she has been heavily retweeted, and generates additional tweets. A network of bots, perhaps.

6. Bridging users connecting two different clusters


The user in the middle here, jameseatontyler, is a Detroit resident with a picture of a wind farm on his profile who bridges activity around GOOD – a community of people who share “what’s good” and ClimateReality – a climate activism community.

The power of graph databases for exploring Twitter data

So we can see that the graph approach is useful for helping us to see patterns in data. The real power of a graph database is that it allows us to use relatively simple queries to find those patterns (path types).

Pushing off from the last example, we can see if anyone else is like jameseatontyler in respect of bridging nodes by using the following query in Neo4j

match (u1)-[:user_reply|user_retweet]->(t1)<-[:owner]-(u2)-[:owner]->(t2)<-[:user_reply|user_retweet]-(u3) where u1.screen_name = "GOOD" return u1.screen_name,t1.tweet_id,u2.screen_name,t2.tweet_id,u3.screen_name;

| u1.screen_name | t1.tweet_id | u2.screen_name | t2.tweet_id | u3.screen_name |
| “GOOD” | 397033579290849282 | “adiskype” | 397055916891058176 | “lucykellaway” |
| “GOOD” | 397034058623905792 | “jameseatontyler” | 397414864903094272 | “ClimateReality” |

This says: find me all the paths where users (u2) have retweeted or replied to both GOOD’s (u1’s) tweets and those of another user (u3).

(lucykellaway is a journalist for the Financial Times newspaper).

We can also generalise to find all instances where users have retweeted the tweets of two other users. This is the real power of using a graph database. Here, we’ve run the query and grouped users by the number of times they participate in this kind of path:

match (u1)-[:user_reply|user_retweet]->(t1)<-[:owner]-(u2)-[:owner]->(t2)<-[:user_reply|user_retweet]-(u3) return u2.screen_name, count(*);
| u2.screen_name | count(*) |
||innovate_uk | 8372
||AGDaKid_ | 1640
||Huw_J | 462
||bway | 210
||QuadShock | 132
||patrickdrive | 110
||ThFuturePerfect| | 30
||ElaineAnderson0| | 30
||DrResources | 20
||Deb78clark | 20
||KirstinStolz94 | 2
(123 users total)

This query took around 2 seconds on a laptop in Neo4j, which is pretty quick. It’s difficult to imagine how you’d even set up the query if the data were stored in SQL.

Comments are closed.