A GitHub for Data Scientists?

Teams engaged in exploratory data analysis or other rapid model-development activities have very specific requirements around how they access and manipulate data. In the course of a single week, they can produce as outputs many models, charts or reports, most of which will be discarded. These outputs may themselves be based on a rapidly-changing set of input data, from both internal and external sources. Datasets may be transformed or manipulated many times by multiple individuals before they are used in reports or other output documents. If you describe these data science challenges to anyone with a background in computer science, they’ll inevitably bring GitHub into the conversation.

Venturebeat has a nice article about four companies who are trying to build a GitHub for data scientists. There’s no explicit analysis about what requirements a GitHub for the data world might meet, but they seem to boil down to these:

  • A repository – a place to store data
  • Collaboration support – tools that help to avoid duplication of effort
  • Version control – the ability to see changes that have been made and roll them back if needed

We did a fair bit of thinking about this topic a while ago and developed our own – currently proprietary – solution called Sandgit (also “Git fo’ data”).

The over-arching requirements we set out when we developed Sandgit have a slightly different emphasis from those foregrounded in the Venturebeat piece:

  • It must be possible at all times, and definitively, to identify via a trace-back mechanism the ultimate (external) source of data used in any piece of work. [Data Provenance]
  • It must be possible to step through (and replicate if needed) the sequence of transformations that resulted in the creation of a piece of data from its ultimate source. [Repeatability]
  • The integrity of individual data file versions should be maintained. [Integrity]
  • Individual data files and data tables should be uniquely and immutably identified, independently of their file names on a local file system. [Uniqueness]
  • Individual data files and data tables should be centrally logged and tagged for easier retrieval in the context of a date or project. [Searchability]

We also made it a requirement that the Sandgit system was itself usable and robust, supporting compliant behaviours without getting in the way of the work itself.

The collaboration piece, for us, is supported by the central logging system, but we make sure we stay in step with one another through agile working and daily scrumming on modelling projects. We also wanted to facilitate working across multiple repositories (including, for the moment, Dropbox) which is why we went down the route of allocating a unique identifier to files.

Work on the project is on-going. We’ve recently started using Redshift and our next challenges are around managing the interface between files and tables in the cloud.

Python + Data: PyData 2014

At Sandtable we use Python throughout our stack. We’re huge fans.

Our Data Scientists use a full scientific Python stack for exploratory data analysis, machine learning and prototyping agent-based models. At the other end, our platform for running large-scale experiments is written mainly in Python.

We really love Python’s flexibility and agility.

So when the opportunity arose to attend PyData 2014 here in London, we were very excited. It was also the first PyData outside of the US.

The conference was a packed three day event from Friday to Sunday (21-23 of February), and held at L39 at One Canada Square, Canary Wharf. As you can imagine, views from the 39th floor were spectacular! Cue photo:

L39_view

The theme of the conference was Python and (Big) Data, with speakers talking on a range of topics, including machine learning, high-performance Python, and data visualisation.

The first day, Friday, was for tutorials: more hands-on sessions than the weekend’s. We particularly enjoyed the opening talk by Yves Hilpisch on Python and Financial Analytics, and the presentation by Bryan Van de Ven on Bokeh, which shows great promise.

On the second day, we thought Ian Ozvald did a great job elucidating on the high performance python landscape. Lessons learnt: profile (of course!), and Cython still good for the win. However, stay agile and be careful of technical debt. Also, some interesting, emerging projects to checkout: Numba, Shedskin, and Pythran. We’re looking forward to the book coming out: High Performance Python. Ian’s own write-up of the conference can be found here.

Later in the day, and something quite different, we particularly enjoyed one-eyed artist Eric Drass’ thoughtful and entertaining presentation on the mix of art and technology. Check out his topical piece: ‘Who watches the watchers?’ — there’s more than meets the eye, promise.

On Sunday, we found Gael Varoquaux’s keynote compelling. He mused on building a cutting edge data processing environment with limited resources. Gael reminded us that software development isn’t just about tools, it’s a social process. And, again, we received a warning about technical debt; we must plan for changes. Also, we found it fascinating hearing about the vision (reality?) of scikit-learn: ‘Machine learning without learning the machinery.’

Later we were mesmerised by James Powell’s whizz-bang tour of Python generators. A Python feature we’ll be sure to harness more in future.

In the afternoon, Bart Baddeley gave a very accessible introduction to similarity and clustering using scikit-learn.

Lots of great talks! We really enjoyed the event. Thanks to all those who organised it!