PyData London MeetUp Launch

Pydata London had their launch meetup last night with pizza beer and python at Pivotal Labs space near Old Street.

Emylyn talked about issues of converting Matlab into Python with a particular look the domain of Phamacology and the Life Sciences.  What I found interesting were the big reasons for researchers in the life science to prefer Matlab to another tool like Python.  A company can pay a lot of money for something like Matlab and feel like they are getting guarantees and value for money.  Python is free, that could be worrying as far as getting quality and assurance.  Matlab has a huge 3rd party lib ecosystem in signal processing, hardware specific hook-ups to lab equipment, and synchronising large experimental trials.  Finally, for a scientist, Matlab’s syntax is closer to the language of math and is easier to express in terms of linear algebra (although there is interesting work with SymPy).  Also, it’s worth noting that for scientific computing Julia got a shout-out for being performative, but it’s also a young and growing language.

Jacqui from Flying Binary spoke about their “Mapping the Future” visualisation of tweets of data journalists.  She pointed out that if Gen Y is visual, then Gen Z (people born since 1993) is kinaesthetic, and what will absolutely be expected by that generation is interactivity.  Think about toddlers weaned on iPad interacting with a paper magazine.

Another twitter related talk was Giles who compared a MCL clustering with the Louvain clustering (a la Gephi) on a two-hop social network.  He found MCL to perform better, and that visual verification of large of graphs Gephi is super difficult.   He’s a big user of Neo4J (with Py2neo) as we are with our network modeling here at Sandtable.

Ben from Goldsmith is geeking out on scraping wordy beer reviews and with some NLTK and mining out key describers of beers.  What I found great was that he skipped the quantitative metrics that the users gave a beer and went straight for the text to create keyword clouds for a beer.

I also gave a lightning talk about my Kinect Stone project – using a Kinect to build a machine collaborative carving system.  With OpenKinect it’s a snap to get your kinect data out to numpy arrays.

Screen Shot 2014-06-04 at 17.53.15

 

Finally, notable shout-outs to the OpenCorporates project and the PyLadies London chapter.  The Pydata (and Data Science in general) community is only growing in London.  Great work to the organiser’s of this Pydata Meet-up – Ian, Cecilia, and Emilyn !

AWS Summit London 2014

On April 30th a few of us attended, along with 2000 other attendees, the 2014 Amazon Web Services Summit at the ExCel Centre, here in London.

At Sandtable, we have been using AWS for over two years. We primarily use it as a platform for running our simulations. However, as we grow, we are pushing more of our Data Science workflows into the Cloud. Our experience of AWS until now has been very good.

The Summit was a sponsored, full day of talks and sessions on AWS and its ecosystem. In the morning, we were welcomed by Iain Gavin, Director of AWS UK, followed by a keynote speech by the VP of AWS, Stephen E. Schmidt. Schmidt gave a full round up of the state of AWS. AWS continues to improve and innovate, recently introducing new services like Workspaces and Kinesis.

Rapid innovation

Innovation, innovation, innovation…

We also heard from several users of AWS: Just Eat, SwiftKey, and Channel 4, about their migration to AWS and the dramatic cost savings since.

As users of AWS, we all benefit from sharing resources with other AWS customers, in particular demanding customers such as Netflix, who push Amazon to improve and update the infrastructure as their business needs shift and scale.

The clear theme of the talks is that AWS has been transformational for many, and continues to be, reducing costs and enabling continual innovation.

At Sandtable, as a small company, we have found AWS to have been critical in our development, allowing us to develop and experiment cheaply. AWS allows us to easily provision the infrastructure required to run our simulations and execute Data Science workflows, and then shutdown down this infrastructure when we’re done. We can be very agile with our resources.

After lunch, we attended a number of break out sessions. The sessions were billed at four levels of expertise (from introductory to expert level), covering a wide range of topics.

The first session was ‘Uses & Best Practices for Amazon Redshift’, given by Ian Meyers. Redshift is AWS’s petabyte-scale Data Warehousing solution. It is a relational data warehouse that compliments the likes of Hadoop (which deals well with semi-structured data). Redshift has some interesting features like its columnar storage and data compression, as well the massively parallel architecture.

We’re already using Redshift to store and serve our project data, and we’re finding its query latency very impressive. It’s also very easy to use.

We also heard a dramatic user story from Adrian Wiles, Enterprise Data Architect at the Financial Times, about the FT’s migration of their DW to AWS, and in particular Redshift. They’ve seen huge reductions in cost over their previous DW solution, and dramatic improvements in performance, in one case, reducing query time from days to minutes. They are now feeding data in near real-time to dashboards for their journalists to view how their online articles are fairing.

The next talk was ‘Dynamic Content Optimisation: Lighting Fast Web Apps’ by Glyn Smith, with a user story from Matt Painter, CTO at import.io. The premise of the talk was that using several of AWS’s services: CloudFront (CDN) and Route53 (DNS), it’s possible to reduce the latency of dynamic, not just static, content in webapps. Of the three talks we attended, this talk was the least relevant to us; nevertheless, we were still impressed. Once again, AWS showing off the breadth and depth of their services.

The final talk was expert level talk on ‘Amazon Elastic MapReduce (EMR) Deep Dive and Best Practices’, given, again, by Ian Meyers. EMR is Amazon’s Hadoop-as-a-service offering. Ian gave a brief overview of EMR, before diving in to planning for cost, EMR design patterns, and EMR best practices. This talk was packed with useful tips not just about EMR but ways of working in AWS.

Lastly, John Telford from Channel 4’s gave an informative, and entertaining, talk on C4′s use of EMR. C4’s Data Scientists use EMR to derive insights about viewers watching interests, and to make recommendations about other shows they may like to watch.

In summary, we’re glad we attended the summit. AWS continues to innovate at an impressive pace. Our use of AWS is relatively small, but growing steadily. We look forward to exploring more of AWS features.

Thanks to Amazon and all the sponsors for the day out.