Europandas summit 2019

Last weekend we attended the first Europandas summit in London hosted by Oakam.  The schedule included an evening of talks on pandas and related topics as well as two full hack days on Saturday and Sunday. So on Friday the pandas enthusiasts assembled.

Image from ecotravellerguide.com

The event was kicked off by the event organiser Marc Garcia, one of the core contributors to the pandas library. Pandas is the most popular Python library for analysing small to medium size datasets. At Sandtable it is one of the open source libraries we use daily and the same likely goes for most Data Science companies that use Python. The capability of pandas in handling different dataset sizes was also one of the main topics addressed in the talks and discussions that followed. The schedule included talks by Python and pandas core contributors as well as related packages being developed to analyse larger datasets such as vaex.

 

Pablo Galindo gave a great overview on how Python as a programming language handles memory allocation and management a topic that primed everyone well on one of the central themes of this evening: the problem of how to analyse big data. One of the disadvantages that pandas has which prevents it of providing the capability of analysing and exploring any kind of dataset is that it needs all data to be loaded into memory. This naturally limits the dataset size that can be easily analysed to whatever memory size the machine that runs pandas has. There are several strategies on dealing with this, most of which need the user to make decisions such as loading and processing data iteratively or subsampling the data. However, in the long term the aim is to allow the proper pandas-like experience to be usable even for datasets that are larger than what fits into memory. Due to the long term importance of this topic, several of the talks that followed addressed this topic in more detail.

Keeping true to the pandas style API that so many Data Scientists are already used to, Maarten Breddels presented his project vaex that allows analysis of big datasets at around 1 billion rows per second (although not with every laptop as the live demo showed). Vaex leverages lazy evaluations and essentially saves expressions for transformation operations executed on dataframes that only get evaluated once there is an assignment or some property (such as the mean of a column) is calculated. He also braved the challenge and gave a live demo showing the performance of vaex on the NYC taxi dataset as well as some extremely neat IPython map widgets that allow you to view histogram level data that gets calculated on the fly as you change the mapview in Jupyter.

View of all pickup location in NYC using Vaex.

Still in the theme of improving performance on analysing big datasets, Sylvain Corlay and Johan Mabille presented there work on xtensor and xframe as well as a nice preview of their C++ kernel Xeus for Jupyter. Xtensor is a C++ library that allows numerical analysis on n-dim arrays objects as well as supporting lazy broadcasting among other things. Xframe on the other hand provides a DataFrame support for C++ and was just released as an early developer preview. Sylvain is also a board member of the NumFOCUS foundation and one of the founders of Quantstack which is the company behind xtensor and several other related products. Both did a great job at showcasing how you can use the xtensor api to do pretty much anything that numpy does directly in C++ with a numpy like api on top.

Antoine Pitrou finished up this part of the talks with an overview of Apache Arrow and started a discussion on its capabilities and potential integration with pandas or a pandas-like API. Joris van den Bossche (another pandas core dev) brought things back into the realm of current pandas by talking about a relative new addition to pandas the extension arrays that he and Tom Augspurger worked on. This allows new datatypes provided by other libraries such as geopandas and cyberpandas to be integrated into pandas with incredible ease and is sure to broaden the already great integration of heterogenous datatypes in pandas. Pietro Battiston gave a taste of potential work to be done over the weekend by talking of various improvements that could be done on restructuring the pandas indexing code for improved performance.

The evening finished with a corporate user panel (including: Maren Eckhoff, QuantumBlackCecilia Liao, dunnhumby; Stephen Simmons, Oakam; Sylvain Corlay, NumFOCUS) moderated by Alexander Hendorf who is involved in the community and helps organising many Python/Pydata related events in Europe (pydata germany, europython and about 10 more that I did not wrote down). It was a great discussion including the biggest pain points of pandas (data size, big API, inconsistent data types when loading from databases) but overall everyone agreed that they would not want to imagine working without pandas. There was also a great discussion around how corporations can give back to the open source community which builds the tools on which immense profits are made. Some ideas suggest further sponsoring of events, donations to projects, bounties for new features or bugs and even hiring freelancers for some months to work on individual topics.

Corporate panel discussion. Thanks to Marc.

The remainder of the weekend consisted of working on the pandas library itself. A lot of the work was dedicated on improving the documentation and bringing it to the next level, led by Marc Garcia. Other work went into fixing some of the bugs and issues that came with the newest release of pandas (0.24). On Sunday there was also a group of new contributors who managed to submit their first PR during the day which was a great achievement. Overall it was a fantastic experience and we are looking forward to more of these events. If you are also interested in contributing to pandas, there will be a Python Sprints Meetup at the end of February organised by Florian and Marc that will continue on the path of this weekend and work on further improving the pandas library.

Comments are closed.