Bad Data Handbook: Mapping the World of Data Problems

http://shop.oreilly.com/product/0636920024422.do
By Q. Ethan McCallum
Ebook: 31.99$ • Print: 39.99$

Bad data is a fact of life. Coping with bad data is a valuable, learned skill. Bad Data Handbook offers insights from over 20 authors based on their years of personal experience managing ill-defined, often chaotic and incomplete data. We begin with a exploration of what is meant by *bad data* and what checks we can preform to help us understand data quality as a prerequisite to data analysis.

Kevin Fink offers suggestions on approaching data critically in order to ensure that we understand what we’re working with before we begin to try to manipulate it. Fink offers useful scripts in shell and Perl that can be used to inspect data and perform basic sanity checks. Paul Murrell tackles the problem of scraping data from sources formatted for human consumption into a format more amenable for algorithmic analysis using R. And on and on.

Each chapter addresses a critical concern in the data life-cycle: identifying, annotating, capturing, archiving, versioning, manipulating, analyzing, and deriving actionable information from imperfect or incomplete data. The advice offered is both powerful and immediately useful to data scientists and newcomers to the field alike and for me has spurred several ideas for how to approach teaching statistics.

Given the number of authors who contributed to this volume, it should come as no surprise that the tone, writing styles, and tools used vary greatly among the chapters, sometimes wandering into technical minutia, but only infrequently. The book holds together remarkably well, regardless, and was a pleasure to read.

Disclosure: I received a complimentary ebook copy of this book to review

Pandora Makes Me Sad

The Problem

Listening to Pandora tests me. Their algorithm seems to be that whenever they detect that I’m listening to a song I like, they should visually or aurally interfere, thus creating the most agitating experience possible.

This evening, before I logged in, they began playing a song I like, but a version I hadn’t heard before so, to fully appreciate the music, I paused it to grab my headphones. Of course Pandora made it impossible for me to listen to the song by modally forcing me to login (#fail). Unfortunately, their login routine isn’t sensitive to what you were listening to immediately prior to login, so they just start playing something new and different (#fail).

Pandora Makes Me Sad

Solutions

  • Don’t get in the way of the user enjoying the experience.
  • Suggest logging in non-modally.
  • If the user is already listening to a song on Pandora, but not logged in, then immediately after login begin playing the same song.

The Calculus of Friendship

When a close friend sent me a copy of this book, his inscription read, in part

it has always been about the students

In this short video, Dr. Steven Strogatz– a Cornell Mathematician– reminds us that the student-teacher relationship is complex, dynamic, enduring, and often unpredictable; far from the Brave New World-style cold, isolationism espoused by the so-called professionalization of education that the United States has experienced over the past 100 years.

 

Adding a keyboard shortcut for Save to PDF… in OSX

When I’m commenting on electronic documents, I find it useful to be able to quickly generate a PDF of the marked-up version of the document to return to authors for review. I annotate the document using track changes and adding comments (using the INSERT > COMMENT feature… not by adding text to the body of the document!!!), then

Save as PDF…

to keep a copy for myself and to email (or post to a course management system) for the author to review.

Unfortunately, OSX doesn’t have a built-in keyboard shortcut for Save to PDF…, but it’s easy to add one.

[Note: you can’t Save to PDF… from an Adobe Acrobat print dialog box… it would bruise their ego]

Storytelling, which I take to mean teaching

This 70-minute lecture by Charlie KaufmanEternal Sunshine of the Spotless Mind, Adaptation, Being John Malkovich— on screenwriting applies equally well, I think, to being an educator. Consider the following excerpt, but replace screenplay with learning– for the student perspective– or even teaching!

A screenplay is an exploration. It’s about the thing you don’t know. To step into the abyss. It necessarily starts somewhere, anywhere, there is a starting point, but the rest is undetermined, it is a secret, even from you. There’s no template for a screenplay, or there shouldn’t be. There are at least as many screenplay possibilities as there are people who write them. We’ve been conned into thinking there is a pre-established form.

While I sometimes found it difficult to distinguish quotations from his original thoughts, I found both to be engaging and inspiring.

Automatically Generating an HTML5-style Cache Manifest from the Command Line

HTML5 introduces the ability to cache content client-side so that often-used resources can be used without re-downloading them. This also enables a site to be viewed from the client when no network connection is available (i.e., offline viewing of the site).

In order for this to work, there are a few things one must do:

  1. Create a plain text file listing all of the resources that should be cached by the user agent (e.g., a web browser)– the cache manifest.
  2. Refer to that file in the opening html tag of every page that will use cached resources.
  3. Configure the web server so that the file is sent to the user agent with a specific MIME type: text/cache-manifest
  4. Regenerate the cache manifest any time you change the files in your site.
Once everything is setup properly, you can visit the site using your favorite web browser. Then, to test whether the caching has worked, you can turn off the network connection to your web browser’s computer and try reloading the page.

Continue reading

Digital Voice Recorders

Digital voice recorders can be a handy tool for dictation or recording research interviews. Here are some of the things I consider when looking for a recorder.

Connectivity Make sure the recorder you choose has a USB port or (even better) a built-in plug. Some recorders do not allow you to transfer your recordings to your computer.
File format The default recording file format should be something that is easily playable on your computer’s already installed software, such as Quicktime, iTunes, or Windows Media Player. WAV and MP3 work well, but many recorders use WMA, a windows format that requires additional software on the Mac to playback.
Microphones You generally want dual (or quad) built-in MICs for stereo recording— invaluable in interview sessions. You can play your recordings with headphones and perceive directionality. Not all recorders record in stereo. Also, an external MIC jack, in case you ever want to use an external microphone (a lapel clipped mic or shotgun mic, e.g.)
Placement A tripod mount screw is handy for setting up your recorder for standalone operation.

Two models I’m fond of:

Zoom H2 http://www.amazon.com/Zoom-H2-Portable-Stereo-Recorder/dp/B000VBH2IG
Olympus WS-600s http://www.amazon.com/Olympus-WS-600S-Digital-Recorder-142610/dp/B000NM8DI6

Know Your Libraries and Librarians

One of the first lessons any successful graduate student (and that should read “undergraduate student”) learns is to introduce themselves to the reference librarian who is responsible for their favorite subject areas. They can serve as guides to the existing collection, alert you to new acquisitions, and help you to acquire books that you may be interested in reading.

Know the LOC system, know which sections interest you, and know who is responsible for maintaining those sections at your institutions. You’ll make a librarian’s day when you introduce yourself as being “particularly interested in the QAs” or any other category.

For me, I always visit these sections, at least:

  • K7555 – Copyright
  • LB – Theory and practice of education
  • Q – Cybernetics/Information Theory
  • QA – Computers/Programming Languages
  • TK – Electronics/Computer Engineering

History of the LOC system: http://www.loc.gov/catdir/cpso/lcc.html

The categories: http://www.loc.gov/catdir/cpso/lcco/

CS4302.01 Advanced Computing Projects

Location: Bennington College
Term(s): Spring 2012
Class size: 4

In this course, we will apply computing methods in order to develop solutions to real world problems. We will focus on problems that require computing in order to create, collect, process, or visualize data and that offer opportunities to hone our coding and software development skills. Students are invited to bring their project ideas or existing projects in need of development into the class.

Prerequisite: Permission of Instructor
Credits: 2
Time: F 2:10 – 6:00 pm
(This class meets during the first seven weeks of the term)

CS2106.01 Understanding Alan Turing

Location: Bennington College
Term(s): Spring 2012
Class size: 13

Alan Turing is a central figure in the history and theory of computing. Turing gave the first precise definition of algorithms and computability and a guideline for understanding artificial intelligence: the Turing Test. Turing played a role in the cracking of German military encryption during World War II and in the post-war development of the first digital computers. Turing lost his security clearance and was largely forgotten for the last half of the 20th century because he was homosexual. We will explore the man, his ideas, and his lasting contributions to modern computing.

Prerequisite: None
Credits: 2
Time: T/F 2:10 – 4:00 pm
(This class meets during the second seven weeks of the term)

CS2113.01 The Nature of Information

Location: Bennington College
Term(s): Spring 2012
Class size: 16

What is information? How do you measure it? Is information perishable? Is it scarce? Understanding what information is and how (and whether) it can be created, shared, manipulated, or destroyed is increasingly critical in understanding science, public policy, and civic engagement. This course will explore how our understanding of information has changed over the past 100 years and how that understanding changes how we behave individually and collectively.

Prerequisite: None
Credits: 4
Time: T/Th 10:10 – 12:00 noon

CS4120.01 Contributing to Free & Open Source Software

Location: Bennington College
Term(s): Spring 2012
Class size: 9

Most of us use free/open source software (the Web, Open Office, R, Linux) or services that rely upon FOSS (Yahoo!, Facebook, Google). In this course we will explore how these software projects are managed, the community of developers working to improve these projects, and the tools and languages they use. We will learn how to read, understand, and contribute to these projects.

Prerequisite: Permission of Instructor
Credits: 4
Time: W 2:00 – 6:00 pm

The Law of Unintended Patterns

For any matched pair of non-trivial examples
there exists (n == 1) pattern that the creator of the examples intended to highlight
but there also exist (1 < n <= infinity) unintended patterns that students will find.

It’s difficult to live-code programming examples… the conventions we use by habit often invite students to find the unintended patterns.

As an instructor, how do I get students to see the single pattern in which I’m interested, rather than the possibly infinite patterns that exist? Or, is that even the best goal? Should I, instead, be encouraging students to look beyond the first pattern they detect in order for them to appreciate the inherent complexity of interpretation?