Train/test/validation splits of large CSV files

September 28, 2018

I recently found time to take part one of the Deep Learning for Coders course by fast.ai. One of the most interesting parts to me was the section on using neural networks to model structured data, using embeddings to represent categorical variables. (I’ll try to consolidate my understanding of that before discussing it in another blog post).

At my current workplace we create some pretty huge (300000 rows x 200000 columns) but fairly sparse datasets (maybe 10% coverage) each week in which the vast majority of the columns are categorical variables. We’d really like to try to ‘fill in’ this sparseness using some machine learning, but as yet it’s proven pretty tricky.

If all the columns measured the same thing then this could be a good situation for collaborative filtering, but that’s not the case here: we have columns like ‘gender’, ‘number of cars’, ‘rating of Taylor Swift from 1-5’, ‘TV shows watched in the last 30 days’ - the list is pretty enormous, and the columns are very much homogeneous.

So I was planning on creating a fancy neural net using embeddings for these categorical columns and seeing where it took me. The first problem is, this dataset is far too big to read into memory - it’s about 1.2GB compressed, and far larger uncompressed. Even reading the first 100 rows into a Pandas DataFrame takes almost 2 minutes and over a GB of memory! Of course, I’ll start with a sample of data, but even doing that ‘fairly’ looks like it could be hard since I can’t get it into memory. There’s certainly no way I’ll be able to create a train/test split of the full dataset. It seems I’ve fallen at the first hurdle…

This irked me. It shouldn’t be so hard to create a train/test/validation split! If only there was a way to do it on the command line…

Of course, there are some ways to do this using existing Unix tools. split, for example, is perfectly capable of splitting up data from stdin into multiple files and comes with various useful options. Using split would create equal sized splits though, and they wouldn’t be shuffled. It also wouldn’t cope well with the header (we’d ideally have that in all of the output files). xsv’s sample subcommand samples N rows from a file, but (as far as I can tell) it’s not reproducible; we’d need to iterate over the data multiple times to get multiple samples, and there’s no way to guarantee future samples don’t contain the same rows!

Maybe there’s a gap in the market.

There are some other useful features I think would be useful in a new tool like this:

  • Ability to specify the proportion of the file OR number of rows going into each split
  • Reproducibility using a pseudo-random number generator with a seed
  • Chunking of each split for easier parallelization further down the line (for example, we’d have a training set of 80000 rows split into 20 files of 5000 rows each, and a test set of 20000 rows split into 4 files of 5000 rows each)
  • Compression of output files, since if you’re forced into using a tool like this, it’s probably because your input is very large!
  • Speed! The files are really big, and iterating over the rows in e.g. Python is likely to be slow - we don’t want this to take hours.

Here’s some example commands showing how I’d like it to work:

# Split the compressed data in data.csv.gz into three sets of proportion 0.7, 0.15, 0.15
$ ttv data.csv.gz --split train=0.7 --split test=0.15 --split validation=0.15

# Only 2 splits, specified slightly differently. Data is uncompressed. Split output files into chunks with max 5000 rows.
$ ttv data.csv --split train=0.8,test=0.2

# Read data from stdin, specify number of rows per split, custom names
$ zcat data.csv.gz | ttv - --split train-mini=1000,test-mini=200,validation-mini=200

I also think that it might be handy to do selection of columns from a file in the same pass, but we’ll save that for later!

I’ve used Rust a fair bit at work, but I’ve yet to write a command line tool with it. It’s pretty perfect for this kind of work though - the focus on speed is crucial when working with data of this size; fearless concurrency means we can use separate threads for different outputs; and safety is essential (for me at least!) because I don’t trust myself to write safe C. I’d also like to learn more about the concurrency patterns - I’ve used rayon’s parallel iterators pretty extensively, but haven’t delved into channels yet!

My original plan of applying the fast.ai methods will have to wait. With a cargo new and a plan of attack, away I go! In the next post I’ll talk through my initial, naive MVP, what I discovered along the way, and how it can (and will) be improved. Until next time!

comments powered by Disqus