Just Do It

I have a terrible habit of grabbing a technical book, reading the first two chapters, and then putting it down to read the first two chapters of another technical book. I’ve tried to keep myself on task with several tools; tasks on Google calendar, chorewars.com, and now schooltraq.com.

My latest book to finish is MongoDB in Action.  I’m a few pages into chapter 3 and I’m up to my old tricks again.  The only thing I can think to do is just keep going, and maybe write about it a bit to remind myself not to give in.  So I don’t forget, here are some reasons why I need to finish this book:

  • I need to do a better job finishing what I start, and now is when I need to do it.
  • It’s about time I learn the virtues of NoSQL.
  • MongoDB has an excellent Python driver, it’s a shame not to use it.
  • SQLite (my goto small application database) isn’t designed for the web domain.
  • The book uses Ruby and Javascript in examples, which I want to give more attention.
  • I am starting a web inventory project and these topics need to be learned.

Well unless I can train a capuchin monkey to bite me when I try to start reading something else, I will have to be accountable for myself.  Time to get back to reading…

13 Years of Reading (Stats)

My Grandmother, when she was alive, was a voracious reader. No matter how thick the tome she’d be through with it in a matter of hours.  Having struggled with reading in grade school I was skeptical of her pace, I still won’t finish a short story without at least spending some time to stop and digest the content. She was a happy speed reader, but I knew there was no way I would be too.  Out of curiosity I wanted to know how it affected her retention, so I quizzed her on past readings authors, titles, subject, but her recollection was very limited. Not only has she forgotten plots, but authors and titles as well. While it should have been no surprise to me that books she read over 40 years ago are long gone, it was also horrifying to think how a book I would literally spend months trying to finish would be forgotten to the point where the entire experience is lost.  Billy Collins, in his poem Forgetfulness, eloquently described how what I had just realized was an inevitability:

The name of the author is the first to go followed obediently by the title, the plot, the heartbreaking conclusion, the entire novel which suddenly becomes one you have never read, never even heard of, as if, one by one, the memories you used to harbor decided to retire to the southern hemisphere of the brain, to a little fishing village where there are no phones. Long ago you kissed the names of the nine Muses goodbye and watched the quadratic equation pack its bag, and even now as you memorize the order of the planets, something else is slipping away, a state flower perhaps, the address of an uncle, the capital of Paraguay. Whatever it is you are struggling to remember, it is not poised on the tip of your tongue, not even lurking in some obscure corner of your spleen. It has floated away down a dark mythological river whose name begins with an L as far as you can recall, well on your own way to oblivion where you will join those who have even forgotten how to swim and how to ride a bicycle. No wonder you rise in the middle of the night to look up the date of a famous battle in a book on war. No wonder the moon in the window seems to have drifted out of a love poem that you used to know by heart.

After the conversation with my Grandmother, I became paranoid about losing what I would invest so heavily in and started keeping records of the books I finished.  I made sure that just after reading the last page I would append some information about the book to a spreadsheet. So far I have been adding to it over the last 13 years, making it one of the few habits I keep.

I decided it’d be fun to visual this small amount of data I’ve been slowly compiling, so I synced up my spreadsheet with goodreads and exported a CSV to play with. Considering I wanted to delete most of the columns with no valuable information, I needed something quick and dirty to edit the sheet.  Out of curiosity I checked if there was way to get Vim to parse a spreadsheet well enough, which led me to the csv.vim plugin.  After a quick install I browsed the sheet and decided I only wanted the dates I read each book, the page counts, and the publication dates.

In favor of doing things the “hard-way” I wrote this one-liner:

for year in {1999..2012}; do cat goodreads_truncated.csv | grep "$year/" | cut -f2 -d'"' | awk -v year=$year '{S += $1} {count += 1} END {print year "\t" count "\t" S}'; done

to give me this table:

1999 1  272
2000 6  1184
2001 15 3874
2002 25 7470
2003 19 4277
2004 31 6917
2005 7  2157
2006 6  1243
2007 3  1143
2008 5  1362
2009 1  450
2010 1  247
2011 14 3644
2012 5  1133

That is, I used the for loop to grep for each of the years from 1999 to 2012 (YYYY/MM/DD), used cut to return the field with the page numbers, passed the year variable into awk and had awk add all the pages by year, count the number of books per year, and format it into a table.

Lastly, I put the data onto a Google Docs spreadsheet (without 2012) to look at and got the following chart:

I used pages despite their inconsistency because they still reveal more than book count, consider Moby Dick versus The Importance of Being Earnest.  The average pages per book in are the red bars, so however many can be stacked up against the blue bars are the books read that year, and average size.  I also graphed out the publication dates against the read dates, but all it showed was I read mostly contemporary literature.

Being a slow and picky reader, I think my data set is rather small but it still reveals some interesting details over the last 13 years of my life.  I attended high school between 2000 and 2004 so I had books I needed to read as part of my course work, I was beginning to read seriously for the first time, and I spent a fair amount of time waiting on buses. I took AP English between 2003 and 2004, and college was the following year, so a lot was crammed in before I had worry about college courses.  Between 2004 and 2008 my reading (free time) dropped off substantially.  From 2008 on I have been working full-time as Software Engineer and spare time is more often spent on projects and technical books that don’t meet my cover to cover criteria for being added.  In November 2010 I purchased a Kindle and have been trying read like I used to.

Hopefully, in another 13 years I’ll have more data to play with and a few better ideas of what to do with it.

Statistics and SIDS

I have just finished reading Statistics: A Very Short Introduction by David J. Hand.  Overall, there wasn’t much practical information to work with, but it did provide a very basic background in general statistical theory, Bayesian networks, and hidden Markov models.  Throughout the book, Hand emphasizes the success of modern computing in the field, gives practical uses for each concept introduced, and explores a range of real-world applications of statistics.

One of the real-world instances chosen by Hand is the probability a mother could lose two of her children to SIDS, aka crib/cot death.  In 1996, UK resident Sally Clark had lost her first son to SIDS.  When her second son passed away in the same manner in 1998, she was arrested and tried for murder. Pediatrician Sir Roy Meadow testified that the chance of two children from an affluent family suffering sudden infant death syndrome was 1 in 73 million. The reasoning for such a low probability was that the second instance of crib death was considered to be independent of the first. If the two incidents are independent, the probability of one instance (1 in 8500) can be multiplied by itself to yield Meadow’s figure. However, as an article written in Plus Magazine explains, it is more reasonable to calculate the conditional probability using Bayes theorem to determine the probability of her innocence.  Which they conclude to be 2/3, given number of births and double murders each year in England and Wales.  Unfortunately, it wasn’t until the second appeal that Sally Clark was exonerated of her charges and released from prison in 2003 after serving 3 years of her sentence.

It was with great frustration that mere hours after reading about Sally Clark’s ordeal that I heard a similar story play out on an NPR broadcast.  In 2008, 17-year-old Nga Truong was interrogated and coerced into confessing to the murder of her own child.  This time is was her brother that had passed alway when she was 8 due to SIDS.  Under the same presumption that two incidents of SIDS in one family is improbable, the detectives aggressively interrogated Nga Troung while accusing her of both deaths.  Troung had been locked up for 2 years before her confession was suppressed on the basis of coercion.  The 13 minute story and police recording is available through NPR.

Taking in both these incidents in the same day was very troubling, primarily because the police in Troung’s case were more determined to get a confession than get the truth.  Even more troubling is the fact that Sally Clark died of alcohol poisoning in 2007, having never recovered from the incident.  While people like Donna Anthony, Angela Cannings, and Trupti Patel were all found innocent after further review of their cases, one has to wonder how many mothers have to face this turmoil without proper defense.

Gödel, Escher, Bach Study Group

Although I have never been much of a fan of Hofstadter, his work has a loyal and faithful academic following. To be honest I was never very tolerant of his fans gushing about how their minds were blown by his revelation inspiring work. However, I think it’s time now that I put that aside and see what GEB is actually worth.

Reddit.com/r/GEB is doing a group read through starting January 17. As the ‘community’ posts related materials as well as self-serving fodder to garner ‘karma’, it may be a good opportunity to ride the wave of peer pressure to complete the novel this decade.

As it has been pointed out on Reddit, MIT’s OpenCourseWare offers video lectures on GEB, albeit aimed at a high school audience. So it might be a good time to dust off my copy and try to knock this one out before the next New Year.

P.S. It’s also worth noting that anyone who intends on following the subreddit for content should be aware the signal to noise ratio tends to be low, especially if the group grows substantially. During the Stanford AI class, the professors would create posts to communicate with the students directly, but it was hardly worth subscribing to due to the deluge of inane posts.