Writing `ttv` - a command-line file splitter - in Rust: Part 1
September 29, 2018
In the previous post I decided that I’d like to write a command line tool to simplify the process of splitting a large, possibly compressed CSV file into multiple disjoint sets for the purpose of train/test/validation of a machine learning model. This time I’m going to talk about my initial implementation. It doesn’t yet have all the features we talked about in that post (far from it), but it’s a start.
The initial version is pretty rudimentary. It uses clap
to parse the split specifications, then just iterates through the rows in the specified input file (it doesn’t yet support stdin or uncompressed files). For each row it uses a pseudorandom number generator (PRNG) to decide which of the splits that row should go to, then writes it (compressed) to the appropriate file. The seed for the PRNG can be set to reproduce the same split if desired.
My main concern with my code is something I’ve fought against before. I’m still not really sure how to reconcile runtime data with Rust’s type system, particularly when the parts that care about the runtime data are nested within a struct (or a struct of structs, or…). So far I’ve been doing it using enums which list all the possibilities:
enum Splits {
Rows(RowSplits),
Proportions(ProportionSplits),
// Maybe we have some more...
}
// The main struct handling splitting of files
struct Splitter {
splits: Splits,
// ...
}
impl Splitter {
fn new(..., splits: &[&str]) -> Result<Self> {
// Parse splits and do turn them into a `Splits` enum
}
fn run(&mut self) -> Result<()> {
let pb = match self.splits {
Rows(r) => ProgressBar::new(r),
Proportion(p) => ProgressBar::new_spinner(),
};
// ... more of the same
}
}
The advantage is that any code that creates a Splitter
doesn’t need to worry about what type of splits it’s using; it just passes a string slice specifying the splits. But that means that whenever the Splitter wants to do anything with the splits it needs to match on the enum and dispatch a different method accordingly.
We can avoid the enum and matching by making Splitter
generic over T
and add trait bounds for the required methods of T
, but it seems that using them requires callers to first parse any strings, then create, the individual Split types separately rather than being able to just pass the string and have it parsed. Perhaps there’s a way to abstract this away in another struct/trait? Or maybe this is a case for trait objects? It’s all still a bit unclear to me. I’ve included a minimal working example at the bottom of this post. I’d love it if someone could explain to me the ‘proper’ way to do this!
Now for the good stuff:
Rust 2018 makes importing so much easier
No more extern crate
! No more #[macro_use]
! Just import the stuff you need, including macros, using use
. E.g. in main.rs
:
use fs::File;
use io::Error;
use log::{info, warning}; // note no extern crate. also, these are macros!
fn main() {
match File::create("/tmp/some_random_file") {
Ok(_) => info!("Sweet, we opened it"),
Err(_) => warning!("Nope, no luck"))
}
}
Clap is great, but perhaps structopt is even better?
I really, really like being able to specify my app’s interface in a YAML file. I’m not even really sure why I like it so much, but I think it’s great. On the other hand, I thought it was a shame that I had to parse all of the arguments into the right formats ‘by hand’:
// sub_m holds the argument matches for the 'split' subcommand
let splits: Vec<&str> = sub_m.values_of("split_spec").unwrap().collect();
let seed: Option<u64> = match sub_m.value_of("seed") {
Some(s) => Some(s.parse::<u64>()?),
None => None
};
Aside - this second one would be possible using combinators instead of the match using the transpose_result
feature on Rust nightly:
let seed: Option<u64> = sub_m
.value_of("seed")
.map(|s| s.parse::<u64>())
.transpose()?;
But it’s still manual. It would be nice if the arguments were parsed into the right types for us (as in Python’s click
, when you declare parameter types).
In fact, this is doable using structopt! Here you define a single struct representing your argument, derive StructOpt
, and your arguments are parsed into the right format for you. We’d lose the ability to define the CLI in YAML, but transferring that to Rust code isn’t too bad.
Both pbr and indicatif are also great
These two libraries are very easy to use and provide similar functionality - a terminal progress bar. indicatif was actually built by Armin Ronacher of Flask fame (he also wrote the redis-rs
Rust library) to replace pbr (see this reddit thread for more details). I used pbr
originally, then realised there may be a situation where I don’t know exactly how many rows there are - the perfect situation for indicatif
’s spinners.
I still don’t know whether to use a BufReader
I used flate2 to read/write gzip files, and used a BufReader
on top of the GzDecoder
to iterate over the decompressed lines. But I’m still not sure whether I should buffer the file underneath the GzDecoder
?! I’ll have to do some benchmarking!
Next steps
There are still some other goals which remain unmet. We can’t specify that we don’t want to compress output files, we can’t pipe data on stdin, and we can’t chunk outputs into smaller files. Also, it could definitely be faster!
The most obvious performance bottleneck in the current version is that everything is sequential - if writing to a file is slow (especially if it’s due to slow compression speed), we can’t read any faster. In the next post I’ll talk about a naive way to overcome this using threads. I’ll then go on to optimise that further by avoiding lock contentions.
The code for ttv
is on GitHub. Feature requests are welcome!
use std::io::{stdin, Error};
trait Processable {
fn process(&self);
}
impl Processable for i64 {
fn process(&self) {
println!("int! {}", self);
}
}
impl Processable for f64 {
fn process(&self) {
println!("int! {}", self);
}
}
impl Processable for String {
fn process(&self) {
println!("string! {}", self);
}
}
enum OneThingOrAnother {
Int(i64),
Float(f64),
String(String),
}
// In general might have a load more methods, meaning we have to match for
// every one.
impl Processable for OneThingOrAnother {
fn process(&self) {
use OneThingOrAnother::*;
match self {
Int(i) => i.process(),
Float(f) => f.process(),
String(s) => s.process(),
}
}
}
// We could make this generic over T and have it created from a T,
// but that needs callers to know about T's...
struct SomeContainer {
some_input: OneThingOrAnother,
}
impl SomeContainer {
fn new(s: &str) -> Self {
let some_input = match (s.parse::<i64>(), s.parse::<f64>()) {
(Ok(i), _) => OneThingOrAnother::Int(i),
(Err(_), Ok(f)) => OneThingOrAnother::Float(f),
_ => OneThingOrAnother::String(s.to_string()),
};
Self { some_input }
}
fn process(&self) {
self.some_input.process();
}
}
fn main() -> Result<(), Error> {
let mut input = String::new();
stdin().read_line(&mut input)?;
let input_container = SomeContainer::new(&input);
input_container.process();
Ok(())
}