Bad Data Handbook: Mapping the World of Data Problems

http://shop.oreilly.com/product/0636920024422.do
By Q. Ethan McCallum
Ebook: 31.99$ Print: 39.99$

Bad data is a fact of life. Coping with bad data is a valuable, learned skill. Bad Data Handbook offers insights from over 20 authors based on their years of personal experience managing ill-defined, often chaotic and incomplete data. We begin with an exploration of what is meant by *bad data* and what checks we can preform to help us understand data quality as a prerequisite to data analysis.

Kevin Fink offers suggestions on approaching data critically in order to ensure that we understand what we’re working with before we begin to try to manipulate it. Fink offers useful scripts in shell and Perl that can be used to inspect data and perform basic sanity checks. Paul Murrell tackles the problem of scraping data from sources formatted for human consumption into a format more amenable for algorithmic analysis using R. And on and on.

Each chapter addresses a critical concern in the data life-cycle: identifying, annotating, capturing, archiving, versioning, manipulating, analyzing, and deriving actionable information from imperfect or incomplete data. The advice offered is both powerful and immediately useful to data scientists and newcomers to the field alike and for me has spurred several ideas for how to approach teaching statistics.

Given the number of authors who contributed to this volume, it should come as no surprise that the tone, writing styles, and tools used vary greatly among the chapters, sometimes wandering into technical minutia, but only infrequently. The book holds together remarkably well, regardless, and was a pleasure to read.

Disclosure: I received a complimentary ebook copy of this book to review