Sandtable’s Principles for doing agile data science

Background

We spend quite a bit of time reflecting on our working practices and trying to improve them. The outcome of one of these bouts of reflection last year was the following set of working principles, which we have refined a few times since then. We developed them to help us work better, delight our clients, and ultimately be successful as an organisation. Like everything else we do, they are subject to continuing revision and improvement.

The principles flow from two observations about the nature of the work we do.

a) Our work involves dealing with complexity. Managing complexity, solving complex problems using diverse data sources, and making complex issues understandable to ourselves and our clients, lie at the heart of what we do.

b) We make progress by building and improving models, and any model can always be improved. A model is never finished: it represents the best picture of the world that there is at the time it is created. A model can never be a complete and final picture of reality.

Our principles are intended to be rational responses to these observations about the nature of our work. They allow us to manage complexity and support continuous improvement.

Anyway, here they are.

 

1. WE WORK COLLABORATIVELY.

EXPLANATION

The problems we solve are often too big and too complex to be tackled by one person, all at once. Working collaboratively means getting out of our own heads and our own working spaces and into a shared understanding and a shared working space. It means putting our thoughts and our work where others can see, and giving them the best chance possible of understanding and working with us.

WHAT THIS MEANS IN PRACTICE

  • Building models together: all modellers should be able, quickly, to get a clear picture of any model that is in production.
  • Having a common understanding of the project we are working on and reflecting that understanding in a continuously maintained document of progress
  • Always being clear about who is responsible for what.
  • Putting our work (data / documents / code / finished models) into a shared domain at the earliest opportunity, and often
  • Publishing models so that others can see them and work with them

 

2. WE WORK ITERATIVELY.

EXPLANATION.

The problems we solve cannot be addressed all at once. At the start of a project it is not clear what the right approach might be or even what the right questions are. We have to build up our models step by step, layer by layer, advancing our understanding of the domain in parallel with the development of the model.

WHAT THIS MEANS IN PRACTICE

  • Using an agile development methodology
  • Maintaining a backlog of things to be addressed within a model
  • Identifying sprints
  • Managing sprints through daily scrumming
  • Reflecting on and learning from each sprint about the problem, the model, our methodology and tools

 

3. WE WORK EFFICIENTLY

EXPLANATION.

Because what we do is complex, it’s easy for work to become messy, and for us to get lost, or go round in circles. Iteration doesn’t mean doing the same thing over and over; it means doing new and different things over and over, till we find something that works better. To do this well, we need to know where we have been, where we are, and where we’re going – and use the fastest route possible to move forward.

WHAT THIS MEANS IN PRACTICE.

  • Always knowing which model is which. Using a versioning protocol.
  • Always being able to find out what is in a given model version (starting with the current one), for example: tracking model components: input data, model, output; keeping version notes – a published summary of the model contents
  • Always knowing the state of validation of the current model
  • Always being able to determine the input data for a given model version: Population; Environment (e.g. time series); Behavioural rules and utility functions; configuration / parameterisation
  • Always knowing what we have tried and how it worked out: keeping track of all of our work, every step of the way, and archiving it; retaining and documenting any work we do on a project that may have value in future projects; conducting post-project and post-sprint reviews; following up with mini-projects to capture outputs
  • Always knowing where we are in a project: scrumming daily and keeping a record of tasks
  • Knowing where everything goes: using standard folder structures and file naming conventions across data stores; using a consistent approach to ingesting and storing newly received data

 

4. WE WORK WITH CONFIDENCE.

EXPLANATION.

Clients need to have confidence in our models but they don’t have the time to invest to build it for themselves, so they rely on us to give them that confidence. If we don’t understand why a model behaves as it does and pretend we do, or if we allow ourselves to make mistakes, we will lose all of our client’s confidence – and often, there will be no way back from that. We need to work carefully, so we know what we are doing is right.

WHAT THIS MEANS IN PRACTICE

  • Having confidence in the code we produce – following Software Engineering best practice – supported through training workshops
  • Being honest about what we do and don’t understand, always.
  • Having confidence, and inspiring confidence, in the models we present, by working with multiple visualisations of the model, and running multiple replications
  • Ensuring there is consistency in the way we present graphs and diagrams – they follow a consistent style, and they are readable and clear.
  • Being clear about where there is uncertainty in our models – both in terms of statistical uncertainty and grey areas in the conceptual model: setting out clear principles for handling noise; showing confidence intervals in results
  • Testing our code, always, before we hand it over to anyone else

 

5. WE WORK SAFELY.

EXPLANATION.

We are entrusted with a lot of valuable and sensitive data. It’s a huge risk to our clients’ organisations – and to our own – if it gets into the wrong hands. Our models, or code and our analysis are the results of many hours of hard work. If we lose them, we lose time and money and reserves of patience getting back to where we should be.

WHAT THIS MEANS IN PRACTICE.

  • Having robust security procedures, implemented according to a comprehensive security policy.

TOOLS WE TRUST

We are human, and often we need help in keeping to our principles. This is why we use tools. Some of the tools we use we have built ourselves; others are available to anyone.

In terms of public tools:

  • We use Trello to maintain a common view of what we are building, what we want to build (our backlog), and to track tasks and responsibilities
  • Slack helps us put more of our communication in the shared domain, where all team members can search through it, and benefit from it.
  • We store all of our code in git at Bitbucket, so that all team members can access it, build on it, and document their builds under strong version control.
  • We carry out EDA using shared iPython Notebooks, which allow us to share analytical results

Our proprietary modelling platform supports fundamental elements of the data science process.

  • It helps us to ingest and keep track of data from a wide range of different external sources, and supports the workflows required to transform the data into model input
  • It allows us to keep track of many different versions of models, across multiple projects, as well as their state of validation.
  • It allows us to visualize model results and explore their behaviour under different parameter configurations – and store those results for future reference.
  • It holds together data, models and results, making sure that individual results can be tracked back to a specific model version and a specific set of input data.
  • It gives us computing power when we need it to iterate faster so we can more easily keep track of where we are in the process of development.

Leave a comment

Please prove that you are human: