Data Skeptic

Feather

Duration:: 23m
Broadcast on:: 13 May 2016
Audio Format:: other

I'm joined by Wes McKinney (@wesmckinn) and Hadley Wickham (@hadleywickham) on this episode to discuss their joint project Feather. Feather is a file format for storing data frames along with some metadata, to help with interoperability between languages. At the time of recording, libraries are available for R and Python, making it easy for data scientists working in these languages to quickly and effectively share datasets and collaborate.

(upbeat music) - Data Skeptic features interviews with experts on topics related to data science, all through the eye of scientific skepticism. (upbeat music) - Wes McKinney is a technologist and tool developer. He is the author of Python for Data Analysis and the founder of the Pandas Library in Python. He's presently a software engineer at Cloudera. Had the Wickham as chief scientist at RStudio and an adjunct professor of statistics at Rice University. He is well known for writing a number of key R packages lovingly referred to collectively as the Hadley verse. He is also the author of several books on R. Wes and Hadley, thank you guys so much for coming on the show. - Thank you. - Thanks for having us. - I brought you guys on to talk primarily about feather today. There's actually so many topics I could talk to each you individually about, but I got really excited when I saw this project come online. So maybe to get started, could you guys give us the high level description of what feather is? - So feather is a file format that is for storing data frames on disk that is interoperable between Python and R. Hadley and I got together and we're talking about some of the work that's going on in the Python and R ecosystems and also in the Apache software ecosystem with the Apache Arrow project. And we thought it would be interesting if we had a really efficient way to store data frames on disk that could be read in an identical or in an interoperable way between Python and R and it would enable easier sharing of data. But within each programming language, you would also have a super simple way to store tables on disk in binary that can be read and written very fast. - In some ways I kind of see feather as like CSV version 2.0, it's much faster to read and write and it's typed so that you don't lose the fact that the column is a date or a date time and combined with effect because it's a binary file format, it's just so much faster to read and write. - So speaking strictly from the end user's perspective and I mean like sort of the end user analyst or programmer working in R or Python, how do R's data frames differ from pandas data frames? - They're similar in a lot of ways. Pandas data frames can function in a mode that's nearly the same as R's data frames. They have a similar set of data types, strings, floating point numbers, integers, Boolean values, categories or factors in R parlance. There's a couple of additional layers of complexity in pandas having to do with the row and column labeling. So there's a mechanism called hierarchical indexing which enables you to have multi-dimensional data represented in a flat two-dimensional data frame. So that's something that is less common in R. There are some small details between the data types themselves and the spectrum of data types that you can represent in R and Python data frames. But I think by and large users use them in a semantically equivalent way in a lot of applications. - Yeah, it's very similar. They're both very similar to the idea of like a table and a database where you've got, basically there was rectangular structure where like each column can be a different type but each row is the same structure. - Yeah, I was an R user myself first and later began working also on Python and I found the transition to using pandas really easy. It was very familiar to me and intuitive as a user. But I've always been curious from more of the language perspective, do these things differ to the way the two languages use these approaches? Do they differ very much under the hood? - I think, I mean, one difference is that, you know, data frames are kind of fundamentally baked and are part of the base language. They're not implemented in add-on package. And I think in ways, can probably speak to this better than I can. And that means there's sort of like one conception of a data frame that everything uses when Python, I think there might be a few different ways of thinking about it. - Yeah, I would say that the implementation of data frames in pandas is somewhat opinionated given that there wasn't a community effort to design a data frame that would work, you know, for the whole language and for the whole ecosystem. So, pandas uses NumPy and Python's scientific computing, array computing tools to create a data frame-like interface instead of tools, you know, for manipulating time series data and tabular data. - Yeah, I think that would be the, I mean, under the hood, there's definitely implementation differences. I would say that ours internal implementation is much simpler, but in a good way. And we are actually, have been in discussions in the pandas community about simplifying the internals of pandas data frames to be a lot more straightforward and with less complexity. Some of that complexity has, you know, some historical kind of legacy issues attached to it that I'm not gonna go into on this interview, but over time, the internals of pandas data frames have grown very complex. - So I've been on teams where we have work going on in parallel efforts, different groups using different programming languages. And I would guess that probably teams have had a problem of interoperability since the second compiler was ever written. So, Feather provides interoperability, but not in the sort of classic code exchange sense like Sython or RCPP does, where it's sort of a language interoperability. How does Feather differ in this respect? - I think we are increasingly seeing teams who are using both R and Python at a minimum, if not other tools like Java and Scala and the whole ecosystem of tools. And it's, you know, exchanging CSVs is kind of fine, but then whenever you load that CSV, you've got to remember to make sure you specify the types of each column correctly. And I think it's quite easy to accidentally mistype one of the columns. To me, like the advantage of Feather is because all of that additional metadata is stored in the file, it's much, much lower risk to share data with other people. And I think that that's just getting, that's more and more important as more and more teams use multiple products in a single data analysis pipeline. - And one thing that I will point out is that the implementation of Feather in R and in Python does share a core C++ library. So part, one of the goals was to be able to think about sharing more code between the R environment and the Python environment. And as you mentioned, with projects like Sython and RCPP, make it easy to take C libraries or C++ libraries and build Python or R user interfaces for those libraries. So I describe it to people as being like the submarine and the periscope where you have the user interface layer, which is what the user sees, but there might be some larger apparatus, which is under the hood, and that can be shared amongst multiple programming languages. - Does that common C++ core help make extensibility easier for future languages? - It does, for example, the Julia community have put some effort in on using the Feather core library to be able to read and write Feather files from Julia, rather than having to create a Feather implementation from scratch, that makes things quite a lot easier. We can add new data types and additional metadata to the Feather format by modifying one code base instead of N, where N is the number of C++ based implementations of feathers, that's quite nice. - Yeah, I see a huge win with that metadata layer. I've been on that path of exchanging CSV files as sort of the contract between languages. Hadley sort of, as you said, it works okay, but it has a number of drawbacks that I think Feather can address. One way I've gotten around it in the past is to say, well, let's just share our common interface as a database. One team can write to a tables and we can read from them. How does Feather compare to a solution like that? - I think, I mean, it's basically kind of, I'd say lower over here as a positive and a negative, right? You don't need to be running a database as well. And then just, you can kind of slurp data out of Feather in the memory much, much more efficiently than you can from a database. I would like kind of hope in the long term we might also see databases adopting Feather as a way to get data out and into a database, particularly for databases that are used primarily for data analysis. - Feather is built on top of Apache Arrow. Can you tell me a little bit about what Arrow is? - Arrow is a new top level Apache project. It is primarily a specification rather than a piece of software and the goal is to have a common or shared memory representation for columnar data. So effectively for tabular data with two goals. So the first is to have an in-memory data structure memory representation that is very efficient to process in memory, but also that can be moved between systems very fast. So you can send data from one system to another with very low overhead. That contrasts with traditional serialization tools in which there's quite a bit of conversion that has to be performed on both sides of exchanging data from one system to another. So you can think of CSV files as being a crude serialization format to text. So the reason that Arrow helps Feather is because we are using the Arrow representation and putting those bytes directly on disk rather than converting to some of the representation. - So then any possible extensions like I believe at the moment there's no compression going on in Feather, is that something you'd wanna implement as part of Feather or would an Apache Arrow developer provide that service that you would just inherit? - We've talked about putting compression within Feather. Currently we're writing a modified Arrow memory to disk, but it would be straightforward to layer different coding or compression schemes on top of the data. We've avoided that because at the moment, the performance of the library is good enough and we don't feel a need to make the implementation a lot more complex given that the project is best suited for ephemeral storage, basically, putting data frames on a shelf or moving data frames between R and Python as part of a workflow. - How does Feather compare to something like the protobuf format? - Well, Google's a protocol buffers project has a couple of things. It has an interface definition language which enables you to describe data structures and describe messages in a generic way, and then it has a compiler which generates language bindings for creating those data structures that you described in the protobuf interface language, and those structures can then be serialized and sent across the wire. They are similar in the sense that they do provide a way for representing data in a form that can be read and written by multiple programming languages, but protobuf isn't designed either as a serialization system nor is it intended for fast reading and writing of large quantities of data. It's generally used as a messaging layer between distributed systems, an efficient messaging layer, but typically for smaller amounts of data that are being exchanged over the wire. - You'd mentioned speed as being a consideration. What sort of speed might I expect under read and write conditions in the current implementations of Feather? - So I think we've done a little benchmarking that suggests, I think for reading, you can read about 600 megabytes a second. I think writing is a little slower if I remember correctly. - The goal ultimately is to be performing at disk performance so that you're not being throttled by some heavy computation that's involved with reading and writing the files that's really just about moving memory to and from disk. There's some optimizations that we've left on the table, just out of time, parts of the right path could be made much faster, but eventually you're going to hit a roadblock in the disk, just purely the disk throughput. So I think with many file formats, the goal is to eventually saturate your IO bandwidth. How fast can your computer solid state drive put bytes on disk or read bytes from disk? - Do I as a user of Feather, do I have to worry about things like whether the data is big Indian or little Indian, Unicode, UTF-8, all these sorts of formatting issues? What concern do I have to have if I adopt this for sharing data across a team? - I think our goal is that there should be no concerns about that, that we've sort of adopted reasonable defaults, like all strings of UTF-8, all data is little Indian and all times are in UTC, just so that there aren't any of those common sources of frustration when sharing data. - Yeah, that's a huge win right there. I cannot tell you how much I'd like to not have to have those concerns anymore. - And I think the goal with like the read and the write functions for Feather, is they basically, you know, when you read it, you just give it a path and when you write it, you give it a data frame and a path. There's basically no options that you can tweak, particularly when reading, that's always going to do the sensible thing, which is, I think the goal that any data import function should basically be as unexciting and uninteresting as possible, 'cause all you wanna do is get your data out of whatever format it's currently stored into a data frame in memory. - What about typecasting? 'Cause different languages will represent things in slightly different ways. In particular, I recall a talk West gave a while back about, just to sum it up, it was sort of the steps that it took to get date frame handled the way it currently is in pandas and the dependencies and all that were involved. Date and times are surprisingly problematic to handle correctly. Does Feather help me with that? - Yeah, I think we've just sort of adopted, you know, what Wes and I think are the kind of the common interchange format. There's, you know, a little, there's some kind of details involved. We have to make some sacrifice in terms of like picking things that we can store on disk quickly that perhaps might not, that might require some conversion when reading or writing. Just because, you know, Python and are stored different types of data differently, but the goal is there's like an end user, you shouldn't have to care about that. The feather library for your package will convert best as possible, whatever data is stored in the feather file into the corresponding, you know, the most likely native type that you want. And I think the only area where there's a little challenge there currently is that R, for example, doesn't have any built-in support for 64-bit integers. So there's simply no way if you write out a 64-bit integer and Python, there's no way to load that into a 64-bit integer and R, something because that data type doesn't exist. I think that's the only thing that's likely to cause any problems and practice. - How does Feather handle that use case today if I try to load that big end into R? - So it just converts it to a double with a warning. - Sounds, yeah, it seems reasonable. - It's basically the best you can do. - Yeah, so I know Feather currently works with R and Python as we've talked about, but I see from your GitHub issues that there's several considerations, like Julia and JavaScript, what does the roadmap look like? - I guess we don't really have a roadmap, so we hope that other people using other programming languages will contribute other backends. So I think the Julia one is pretty advanced. - I think I heard someone talking about a Go back end. I think someone's working on a JavaScript back end, which is a bit more challenging because you can't easily take advantage of the C++. But so to me, Julia and JavaScript would be the two big ones, I think, in my opinion, and maybe a Java one for the whole sort of a doopy ecosystem. - Feather has some really obvious use cases for helping collaboration. That was at least what struck me initially when I started looking at the project. Do you think there's also other potential use cases like perhaps in an ETL pipeline? - It could be very useful when you're dealing with a step in an ETL pipeline where you're processing a large amount of memory, and you need to have a fast way to push data to disk. And so this is kind of the ephemeral storage use case, so you aren't ready to do anything else with the data, but you need to move it out of memory. I think Feather is a very good tool for that use case. - I recall reading somewhere in the documentation that the Feather format is not meant for long-term storage. I think the primary reason why is that future enhancements might change that format in some way. So with this in mind, I was curious to get you guys thoughts on something, if maybe I, as an API creator, considered offering Feather as one of the possible formats to return alongside traditional things like just getting a JSON output or an XML output. If a developer did that, is it something you guys would endorse, and do you see any advantages in doing that? - It seems like a pretty big advantage to me, just in terms of if you are sending data frames, you pretty get a pretty efficient way of doing it. I think that the main problem currently is because Feather doesn't support compression, you'd probably want to send it, you'd probably want to compress the stream kind of externally. Yeah, I think that seems reasonable. I think one thing that we will need to do with Feather at some point is to start thinking more about kind of like, if this is going to be a long-term storage that we want lots of people to use, we're probably going to need to version the format and think about backward compatibility and all that kind of stuff. - I can see many use cases for Feather in both academia and in industry. I'm curious if you guys have seen adoption moving faster on either of those fronts, or maybe any interesting use cases you weren't expecting? - Not yet, I mean, Feather seems like one of these, seems like a sort of a sleeper package in the sense that, I mean, the way that I normally tell if people are using my packages is when they complain about bugs. Well, they want to do something they can't. And Feather is just like sort of so simple and so well-scoped. It feels like we're not going to get a lot of feedback from that sense. You know, I think we've got quite a bit of, we've had quite a few GitHub issues with people using it on various different platforms. And I think that speaks to the wide interest in it as a, as a file format. - Yeah, definitely. And this last question might be a bit out of scope for what Feather intends to be, but do you have any advice for data scientists or engineers who want to practice good data provenance procedures in conjunction with Feather? Meaning, you know, are there any best practices for communicating the original source of the data set or any transformations that were done to it? Anything that might make someone question the validity of their results when they are simply getting data passed over? - I mean, to me, like the key with any question of data provenance, it's all about like making sure that, you know, not only can you reproduce it reliably with it, with a script, but that script has been, you know, rewritten or refactored. So you can actually understand what it does. Like I think that's just like so vital. You should always be able to, for anything important, you should be trying to make a script that basically anyone can read and follow so that they can understand like not only what you did, but why you did it and have some clue whether you made the right design decisions. - Are there any opportunities for people to contribute? So we've talked about extending to other languages and we've mentioned that there's a GitHub issues list. Any in particular features you're hoping that maybe some open source contributors might come around and take interest in? - I mean, I think working on, working on additional language bindings is a definitely an area for people to contribute. The functions for interacting with Feather Files right now are also are quite rudimentary. So there's plenty of additional tools and interfaces that could be created to make, you know, reading smaller sections of files or are already has arguments for reading a subset of columns from a Feather Files. That's something that needs to be added to the Python API. Internally, there's plenty of work that can be done on performance optimization or proposing compression schemes for data that is, you know, that is compressible. Somebody raised an issue about writing a Feather File in multiple chunks. So right now it's kind of a one-shot affair, whereas if you had some process where you're building a very large Feather File from a, say, a directory of CSV files or each of the CSV files might be, you know, several hundred megabytes, it might be nice to be able to build a very large Feather File incrementally and that's something that, you know, would be happy to see people contributing to. - Yeah, that'd be very cool. I think, especially when you're getting a really larger data set and whatnot. So in terms of the listeners want to follow you guys online, now what's the best place? - For me, I think Twitter is probably based, just headly welcome, Twitter. - Same here, I'm Wes McKinn, so my name, but without the EY. - Cool, I'll be sure to have both of those in the show notes. Well, thank you guys so much. This was really great. Glad you came on to the chat about this. I hope the listeners take an interest, maybe become contributors and certainly integrate it where it makes sense for their projects. - Yeah, thanks for having us. - Perfect, thanks a lot. - Before we sign off a quick announcement, Data Skeptic is looking for two interns for a part-time remote position this summer. If you're a current student in a field related to data science, check out all the details at dataseptic.com/interns. Your work directly with me and the output of that internship will be turned into a future mini episode here on Data Skeptic. So if you're interested, please check out the details. Again, it's dataseptic.com/interns. Thanks again to Hadley and Wes for coming on the show to tell us about Feather. And until next time, I wanna remind everyone to keep thinking skeptically of and with data. - More on this episode, visit dataseptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (bouncy music)