"Data science at the command line" by Jeroem Janssen
Data science is a topic that is attracting great interest today. Like in any other profession it is important to choose the right tools for the right job, even more so if you are dealing with large data sets.
There are many powerful tools available that can all be used on the command line. Their combined power is truly impressive, but it is important to learn how to use them, in order to get your data science project going.
Jeroeme Janssen has written a book "Data science at the command line" that will teach you those tools and it will whet your appetite for more command line.
The book is not only aimed at data scientists who have little expertise with command line tools, but also at people who feel at home in the Unix/Linux world as quite a few of the introduced programs are not part of the classic Unix tools like cat, grep, head, tail and wc. Most of them are relatively new and some have been contributed by the author himself. Indeed it looks as if the field of command line tools is quite dynamic. So watch this space!
To get started with the examples in the book, you can download a virtual machine that contains all the command line tools from this website (http://datasciencetoolbox.org). Personally I prefer to install all the missing software by hand onto my system. This takes no more than half an hour and is well documented at the end of the book.
After an introduction to the command line and the chapter about how to install the data science toolbox (the virtual machine that contains all the software described in the book), the reader is guided through the book following the steps of the OSEMN model. OSEMN stands for Obtaining, Scrubbing, Exploring, Modelling and iNterpreting data. The author doesn't try to introduce and discuss each of the steps, but instead gives examples on how to utilise the command line tools for the specific goal.
Throughout the examples Unix pipes are heavily used. This enables the user to use the output of one program as input for the next. Powerful constructs, which represent your data work flow, can be build this way.
In the chapter 'Obtaining data' four examples are described: Converting spreadsheets, querying relational databases, downloading data from the internet and calling web APIs. These are the most common ways of obtaining your data. Two of the tools used to manipulate data are part of the csvkit, an package that deserves a lot of kudos. (http://csvkit.readthedocs.org) the program in2csv (part of csvkit) is used to convert spreadsheet data to csv, in fact csv is the format that will be used throughout the book as this is the format that the text based command line can digest easily. With sql2csv another tool from the csvkit, you can query your database and turn the output to csv.
Csvkit allows you also to manipulate your tabular data to filter or even query it SQL style, which is necessary for the next step - the scrubbing. CSV is not the only format used. JSON and XML are two other major formats that can be handled well, by using the command line.
The 'Exploring Data' chapter introduces parallel processing. This demonstrates, how the command line scales very well, and the book would have missed the point without mentioning this. The packages used is GNU parallel. Whit its help you can distribute your jobs across multiple servers. These can be virtual servers provided by Amazon Web Services or other providers. To compute descriptive statistic you can pipe your data into R with the help of a wrapper called Rio. Rio is one of the tools contributed by the author himself.
Using the command line doesn't mean you have to miss out on visualisations. The package ggplot (a plotting library for R) and the old acquaintance gnuplot can be used to produce image files. With the help of a light weight web server, you can view the output of your visualisations in your web browser. With the help of gnuplot it is even possible to plot your graphs into your ASCII terminal. (I must admit this has a certain charm, but perhaps I'm just getting a bit, nostalgic here.)
In the chapter about modelling, the reader is guided through three examples of typical models. These are dimensionality reduction using tapkee (), clustering with weka, regression with the help of SciKit-Learn and classification using BigML. Each of these models would deserve their own chapter, but the scope of the book is to explain how to use those tools on the command line. Each of these packages are well documented and definitely worth further reading.
At this point it is worth mentioning that you can also develop or adapt your own models to plug them into your pipeline. The main concern is, making sure that they can read data from the standard input, to integrate them with the other tools.
The final point of the OSEMN model is the interpretation of the results. The author omits this step, explaining that for the interpretation you need your own brain, your imagination and this is a job for a human. The computer has done it's job here by providing you with the best information to base your decision on.
Having read the book I can very well envisage the application of command line tools in a big data environment. One of the problems faced by large datasets is moving prohibitively large amounts of it around. When you've got your data stored on a server or in cloud, it makes sense to analyse it there. Tools that can be run remotely and operated through a terminal connection come in handy here. In such a set up these tools can be complemented by ipython sessions, that can be accessed through your browser, or an Rstudio server. Command line tools can also be part of automated analysis jobs.
So in a nutshell, "Data science at the command line" is an inspiring book that can serve as a data science cookbook, as the example in the book will be applicable to real work scenarios. It certainly helped me in my data science projects. The author Jeroeme Jansson deserves great thanks not only for giving an introduction to the use of the command line tools in data science, but also for making this approach possible by developing the missing tools, to seamlessly connect every step of your work flow.
However there is one tool for which I still try to find a use case: cowsay