Strata+Hadoop 2015 London

The 2015 Strata+Hadoop World London conference was held last week at the Hilton Metropole on Edgware Road. The conference brings together practitioners in Big Data (BD) and Data Science (DS), particularly around the Hadoop ecosystem. The conference was spread across three days: Tuesday to Thursday, with Sandtable attending, for the first time, on Wednesday and Thursday. The conference schedule was packed from start to finish with interesting sessions ranging from the use of Big Data in Healthcare to discussions on data strategy.

Note that many of the talks were recorded and are available, or will be shortly, with slides on the conference website.

2015-05-06 08.52.15

Here are some of our highlights from the conference.

Day 1 – Wednesday

Keynotes

The conference format for Keynote speeches involved a number of short, sharp, talks mainly on the use of BD in different organisations. The keynotes included talks from the Santander Group, Teradata, Shazam, Google, and The Financial Times.

Cait O’Riordan from Shazam gave a fascinating talk on the use of data at Shazam. Using data from their service, Shazam are able to predict hits and gain insights to the structure of successful songs. In particular, they can identify which parts of songs attract the most attention.

Shazaming to the top

Shazaming to the top

Tim Harford (of Undercover Economist fame) gave an entertaining presentation on two types innovation: incremental and long-shots. He referred to stories from sport, music, and science to demonstrate where ideas emerge from. In particular he drew from the extraordinary life story of Nobel prize winning geneticist Mario Capecchi to highlight the character and vision often required to shoot for long-shots, and succeed.

Talks

The first talk we attended was given by Christine Foster from Shopkeep. Foster shared some important lessons learnt from running data science projects. Her comments on the future of data science as a set of conversations, or meetings, resonated with us. Data science is as much about communication, as it is analytics, both in determining which problems to prioritise and in effectively communicating findings.

Edd Dumbill of Silicon Valley Data Science (SVDS) gave a very interesting presentation about data strategy. Edd talked about what a data strategy is and why it’s important for organization to develop one. It’s time to move beyond doing things to data: cleaning, validating, protecting; and move to driving business value – focussing more on what you do with data.

Martin Kleppmann gave a colourful talk on systems that enable data agility. The focus of the talk was on Apache Kafka and how it can function as a data exchange backbone for organisations. The overarching idea is that we should log everything, pushing events in real-time into a stream processor like Kafka making it quickly available for analysis. Data is only as valuable as the decisions it enables, afterall. Martin is also writing an Oreilly book: Designing Data-Intensive Applications that looks very interesting.

Maslow's hierarchy of data needs

Maslow’s hierarchy of data needs

After Jim Scott expounded on his Zeta architecture – the architecture for enterprise scale and efficiency. The hexagonal architecture includes layers like Container System; Compute Model/ Execution Engine; Real-time Storage. At its core it’s an architecture for increasing the speed of integrating data into the business. The architecture feels familiar to us at Sandtable, as we’re preparing to deploy a Cloud-based platform built using S3, Mesos, Spark, and Docker. For more info on the architecture, see Scott’s Radar article.

Our final talk of the day was given by Carlos Guestrin, CEO of Dato, on deep learning (DL) and how reusing deep features can make DL easier. Carlos started with an overview of DL and demonstrated a DL classifier for identifying objects in photographs running on the Dato Platform. Deep learning is a hot subject in Machine Learning/ DS at the moment. The idea of reusing deep features (layers of the networks) for different tasks is very interesting as deep learning requires huge amounts of data and computation, and parameter tuning, putting it out of the reach of many. Prof. Guestrin noted the need for a central repository for sharing deep features. If DL is you’re sort of thing (and you haven’t got infinite resources) then watch this space.

Day 2 – Thursday

Keynotes

The third day of the conference (our second day), followed the same structure as the previous day: starting with a set of quick keynotes and followed by longer sessions.

Of the keynotes, we found the talk by Rod Smith (IBM) the most compelling. Rod Smith, who apparently holds a patent for the progress bar (!), shared his thoughts on emerging technologies and practices in the field. He argued that real-time is on the horizon for business decisions (real-time was definitely a conference theme). Real-time systems are being enabled and simplified by technologies like Apache Spark that adopts a unified programming model: batch; streaming; interactive. Moreover, these technologies allow developers to focus on the business problems at hand, rather than having to prematurely lock down technical decisions. Rod also discussed notebooks, and noted that they’re the next spreadsheets. They provide a web-based, interactive environment for collaborative DS work – one place for teams to share insights and visualisations. They’re so important, Rod claims, that for future college graduate in many domains, notebook IDEs will be the norm.

Here at Sandtable we’re very passionate about notebooks, in particular open-source Jupyter notebooks (the language agnostic successor to IPython notebooks). In fact, it wouldn’t be an exaggeration to say that our Data Scientists live and breathe in notebooks. They’re so important to us that we’ve built a collaborative Data Science Platform, the Sandtable Model Foundry, with Jupyter notebooks at the core.

Talks.

Mikio Braun gave a good overview of how to do scalable machine learning. As the amount input data and number of models being development increases, strategies are required for doing large-scale machine learning. A common scaling approach is to parallelise bottlenecks in the process, where possible. In fact, in general, several stages of the ML pipeline are easily parallelised, such as data preparation, extraction and normalisation; and even doing parallel runs of the ML pipelines for evaluation. Where it is difficult, for example, for model learning, other approaches are required: such as solving similar but simpler problems as approximations; or distributing search across parameter servers.

Pipelining.

The Pipeline.

In the afternoon, Deam Wampler gave an overview of Spark on Mesos, and comparing Mesos with Hadoop YARN (Yet Another Resource Negotiator). We were glad to see this session scheduled, as we’re very excited about Mesos, and we’re experimenting with Spark in this very setup.

The final talk we attended was about a newcomer on the BD scene, Apache Flink. Stephan Ewen, one the creators of Flink, gave a deep dive into Flink, discussing its current use-cases and some of the technical details. Flink looks interesting, although it only supports Java and Scala currently (we’re Python users), and appears to target similar use-cases to Spark. Flink appears to perform well against Hadoop, but when will we see a comparison with Spark?

Although Hadoop MapReduce (MR) has been the workhorse for big data processing for a few years, Apache Spark, a relative newcomer on the scene (with plenty of buzz), was very present at the conference with plenty of mentions, and several sessions dedicated to it. Will Spark go on to replace MR? Streaming and real-time analytics also appear to be a hot topics – in particular around Apache Kafka. Is streaming data the end of the road for big data storage and processing? A unified model?

Overall, the general theme of the conference was summed up nicely by the title of Edd Dumbill’s talk: “It ain’t what you do to data, it’s what you do with it”. It’s time to convert data and technology to solutions and drive business value. Solutions involve technology, of course, but people too, and require the integration and adaptation to suit different organisations needs, and that has to be where the next set of (big) challenges lie. Onwards and upwards.

Leave a comment

Please prove that you are human: