Smart Data

Everybody is talking about "open data", "big data" and similar buzzwotopics. The point here is to gather and analyze boatloads of raw informations for no apparent reason, just to give them later a meaning and, sometime, a usefulness.

I've decided to play a little with the same approach, working not on some random dataset fetched from the Internet but on informations about myself and my daily routine. For the sake of buzzwording, call them "smart data". Just for few reasons:

  • see what it happens
  • play with Tracker, the GNOME metadata store, which comes with a rich RDF ontology to describe in a semantic graph most of common contents managed by any computer user
  • fun

The primary and biggest collection of digitalized personal data anyone has is probably his own mailbox: here we find messages - often in series -, relationships among many persons, time series, and of course a lot of text to be indexed and harvested.

Just to begin I've indexed only the metadata of messages in the "inbox" folder of one of my many mail accounts (a GMail one). That is: who sent them, who received them, sending date, subject, and eventual "in-reply-to" relations. I've hacked up this Python script, packing everything in Tracker: it took 84 minutes to handle 26072 messages, don't try this at home if you don't have at least an SSD disk (or a lot less mail to manage).


The first analysis to run is the one about the most frequent sender, or: who spammed me more? This can be found executing a SPARQL query crafted by hand using the tracker-sparql tool:

tracker-sparql -q "SELECT ?n COUNT(?e) WHERE {?a a nco:Contact . ?a nco:fullname ?n . ?e nmo:from ?a . ?e a nmo:Email} GROUP BY ?a ORDER BY COUNT(?e)"

And on top of the list I find (family names are hidden for privacy):

  • Facebook, 1614
  • Michele XXXXXX, 779
  • Webmaster ILS, 657
  • Andrea XXXXXXX, 377
  • Direzione ILS, 296
  • Luisa XXXXXXX, 250
  • Stefano XXXXXXXX, 249
  • Giulio XXXXXXX via LinkedIn, 243
  • Flavio XXXXXXX, 236
  • PayPal, 196

Some of them were predictables (Michele knows I'm more affordable by mail than by phone...), some are notifications sent by cron scripts I've placed on some server around, but I never noticed the quantity of spammy notifications received from Facebook and still have to investigate on those mails coming from a contact in LinkedIn. Probably getting rid of those old, forgotten messages I can recover at least 2% of my GMail space...

Contacts Relationships

Running meaningful SPARQL queries by hand is not that easy, I know just a part of the language and Tracker itself supports only part of the available operators. To go deeper in analysis, I've used some other dedicated Python script to get data, elaborate, and visualize. After a little googling I've found this library to manage complex graphs in easy way, for which are available good examples on the Net.

An interesting graph to draw is the one about people relationships. Who sent mails to who? Who were in TO and CC other than me (or: who I put together as recipients of the same mail)? Here is the script, here the results (click the image to enlarge):

Given that this is a potentially wrong result (as the initial indexing script is not able to recognize the same person when she is using different addresses, except when using the exactly same name), you can spot different clusters: the dense group in the center are the people I'm most often in relation to, all mutually connected, then we have scattered groups probably related to past works, interest groups, and some mailing list (I've moved most of my subscriptions to other accounts or filtered in dedicated folders, those must be from a far away past). Into them, almost all of the contacts are strictly connected each other.

A different visualization of the same concept is the one that skips direct relations among people: as there is at least some path in the graph connecting the node, the direct edge is ignored (click the image to enlarge):

In this case, the different and isolated clusters emerge even more promptly.


Another funny information to visualize (using another dedicated script) is the complexity of threads I've been involved. Given that, as said, this mail account should be almost free from mailing lists messages, there are no so deep trees, but still it emerges some weird shaped graph (click the image to enlarge):

Most of them are direct question/reply communications, some of them involve a longer chain, then there are the most intense discussions branching in all directions.

Conclusions (for now...)

Which is the hourly distribution of messages I receive? Or the distribution across the time? The most busy day of the week? Or, having even more informations indexed: who sent me most bytes? How many messages I never replied? Which relationships lies behind most popular mailing lists?

I will try to go further, both in data harvesting and analysis. Just in name of exploration (and explotation). Because the "little big data" from my own mails tell me about myself and people I'm surrounded, dynamics and facts I lived (and I live). Because those are my "smart data".