Teams engaged in exploratory data analysis or other rapid model-development activities have very specific requirements around how they access and manipulate data. In the course of a single week, they can produce as outputs many models, charts or reports, most of which will be discarded. These outputs may themselves be based on a rapidly-changing set of input data, from both internal and external sources. Datasets may be transformed or manipulated many times by multiple individuals before they are used in reports or other output documents. If you describe these data science challenges to anyone with a background in computer science, they’ll inevitably bring GitHub into the conversation.
Venturebeat has a nice article about four companies who are trying to build a GitHub for data scientists. There’s no explicit analysis about what requirements a GitHub for the data world might meet, but they seem to boil down to these:
- A repository – a place to store data
- Collaboration support – tools that help to avoid duplication of effort
- Version control – the ability to see changes that have been made and roll them back if needed
We did a fair bit of thinking about this topic a while ago and developed our own – currently proprietary – solution called Sandgit (also “Git fo’ data”).
The over-arching requirements we set out when we developed Sandgit have a slightly different emphasis from those foregrounded in the Venturebeat piece:
- It must be possible at all times, and definitively, to identify via a trace-back mechanism the ultimate (external) source of data used in any piece of work. [Data Provenance]
- It must be possible to step through (and replicate if needed) the sequence of transformations that resulted in the creation of a piece of data from its ultimate source. [Repeatability]
- The integrity of individual data file versions should be maintained. [Integrity]
- Individual data files and data tables should be uniquely and immutably identified, independently of their file names on a local file system. [Uniqueness]
- Individual data files and data tables should be centrally logged and tagged for easier retrieval in the context of a date or project. [Searchability]
We also made it a requirement that the Sandgit system was itself usable and robust, supporting compliant behaviours without getting in the way of the work itself.
The collaboration piece, for us, is supported by the central logging system, but we make sure we stay in step with one another through agile working and daily scrumming on modelling projects. We also wanted to facilitate working across multiple repositories (including, for the moment, Dropbox) which is why we went down the route of allocating a unique identifier to files.
Work on the project is on-going. We’ve recently started using Redshift and our next challenges are around managing the interface between files and tables in the cloud.