Whoops, I Seem To Have Written a Book. (A trip through Python and R Towards Truth)

Mark’s 1000th post (congratulations again!) and my own data analysis reminded me of something that I’ve been meaning to do for some time, which is work out how much I’ve written over the 151 published posts that I’ve managed this year. Now, foolish me, given that I can see the per-post word count, I started looking around to see how I could get an entire blog count.

And, while I’m sure it’s obvious to someone else who will immediately write in and say “Click here, Nick, sheesh!”, I couldn’t find anything that actually did what I wanted to do. So, being me, I decided to do it ye olde fashioned way – exporting the blog and analysing it manually. (Seriously, I know that it must be here somewhere but my brain decided that this would be a good time to try some analysis practice.)

Now, before I go on, here are the figures (not including this post!):

  • Since January 1st, I have published 151 posts. (Eek!)
  • The total number of words, including typed hyperlinks and image tags, is 102,136. (See previous eek.)
  • That’s an average of just over 676 words per post.

Is there a pattern to this? Have I increased the length of my posts over time as I gained confidence? Have they decreased over time as I got busier? Can I learn from this to make my posting more efficient?

The process was, unsurprisingly, not that simple because I took it as an opportunity to work on the design of an assignment for my Grand Challenges students. I deliberately started from scratch and assumed no installed software or programming knowledge above fundamentals on my part (this is harder than it sounds). Here are the steps:

  1. Double check for mechanisms to do this automatically.
  2. Realise that scraping 150 page counts by hand would be slow so I needed an alternative.
  3. Dump my WordPress site to an Export XML file.
  4. Stare at XML and slowly shake head. This would be hard to extract from without a good knowledge of Regular Expressions (which I was pretending not to have) or Python/Perl-fu (which I can pretend that I have to then not have but my Fu is weak these days).
  5. Drag Nathan Yau’s Visualize This down from the shelf of Design and Visualisation books in my study.
  6. Read Chapter 2, Handling Data.
  7. Download and install Beautiful Soup, an HTML and XML parsing package that does most of the hard word for you. (Instructions in Visualize This)
  8. Start Python
  9. Read the XML file into Python.
  10. Load up the Beautiful Soup package. (The version mentioned in the book is loaded up in a different way to mine so I had to re-enage my full programming brain to find the solution and make notes.)
  11. Mucked around until I extracted what I wanted to while using Python in interpreter mode (very, very cool and one of my favourite Python features).
  12. Wrote an 11 line program to do the extraction of the words, counting them and adding them (First year programming level, nothing fancy).

A number of you seasoned coders and educators out there will be staring at points 11 and 12, with a wavering finger, about to say “Hang on… have you just smoothed over about an hour plus of student activity?” Yes, I did. What took me a couple of minutes could easily be a 1-2 hour job for a student. Which is, of course, why it’s useful to do this because you find things like Beautiful Soup is called bs4 when it’s a locally installed module on OS X – which has obviously changed since Nathan wrote his book.

Now, a good play with data would be incomplete without a side trip into the tasty world of R. I dumped out the values that I obtained from word counting into a Comma Separated Value (CSV) file and, digging around in the R manual, Visualize This, and Data Analysis with Open Source Tools by Philipp Janert (O’Reilly), I did some really simple plotting. I wanted to see if there was any rhyme or reason to my posting, as a first cut. Here’s the first graph of words per post. The vertical axis is the number of words and the horizontal axis is the post number. So, reading left to right, you’ll see my development over time.

Words per Post

Sadly, there’s no pattern there at all – not only can’t we see one by eye, the correlation tests of R also give a big fat NO CORRELATION.

Now, here’s a graph of the moving average over a 5 day window, to see if there is another trend we can see. Maybe I do have trends, but they occur over a larger time?

Moving Average versus post

Uh, no. In fact, this one is worse for overall correlation. So there’s no real pattern here at all but there might be something lurking in the fine detail, because you can just about make out some peaks and troughs. (In fact, mucking around with the moving average window does show a pattern that I’ll talk about later.)

However, those of who you are used to reading graphs will have noticed something about the axis label for the x-axis. It’s labelled as wp$day. This would imply that I was plotting post day versus average or count and, of course, I’m not. There have not been 151 days since January the 1st, but there have been days when I have posted multiple times. At the moment, for a number of reasons, this isn’t clear to the reader. More importantly, the day on which I post is probably going to have a greater influence on me as I will have different access to the Internet and time available. During SIGCSE, I think I posted up to 6 times a day. Somewhere, this is lost in the structure of the data that considers each post as an independent entity. They consume time and, as a result, a longer post on the same day will reduce the chances of another long post on the same day – unless something unusual is going on.

There is a lot more analysis left to do here and it will take more time than I have today, unfortunately. But I’ll finish it off next week and get back to you, in case you’re interested.

What do I need to do next?

  1. Relabel my graphs so that it is much clearer what I am doing.
  2. If I am looking for structure, then I need to start looking at more obvious influences and, in this case, given there’s no other structure we can see, this probably means time-based grouping.
  3. I need to think what else I should include in determining a pattern to my posts. Weekday/weekend? Maybe my own calendar will tell me if I was travelling or really busy?
  4. Establish if there’s any reason for a pattern at all!

As a final note, novels ‘officially start at a count of 40,000 words, although they tend to fall into the 80-100,000 range. So, not only have I written a novel in the past 4 months, I am most likely on track to write two more by the end of the year, because I will produce roughly 160-180,000 more words this year. This is not the year of blogging, this is the year of a trilogy!

Next year, my blog posts will all be part of a rich saga involving a family of boy wizards who live on the wrong side on an Ice Wall next to a land that you just don’t walk into. On Mars. Look for it on Amazon. Thanks for reading!


Another month, another milestone!

That’s another month of blogging down. At some stage, I plan to measure what my output has been and try to come up with some indication of how I can improve my content. I’ll probably try to make things tighter, add some picture, but have separate longer essays occasionally.

Only 10 more months of 1 post / day to keep to my original goal!

Thanks for reading – if you’re new, you can start at Jan 1 and work forward, if you’re a long-time reader, thanks for sticking around.

I wanted to put a picture of success or winning here but, frankly, there are only so many pictures of grumpy babies and Charlie Sheen that anybody needs. So enjoy the rapturous and simplistic text. I’ll see you tomorrow.


Walking the walk: How Mark Guzdial Nearly Created a University of Programmers

My apologies to Mark, who reads this periodically, but I’d like to introduce more people to Mark’s blog and I thought I’d frame this in terms of a teaching anecdote. Mark, no doubt, has millions of followers, but for those who have entered the CS Ed blogging community through me, you should know that oranges are not the only fruit. And there is some excellent fruit out there!

Mark has an excellent blog that, at least in part, helped to inspire my blogging activity here. There are many reasons you should read this – basically, I believe we should all be reading the edublogosphere more widely for simple reasons of immediacy and accessibility –  but the main one is that the information and discussion contained therein are well-written, easy to digest and based on a solid, authentic foundation.

My last post was about authenticity and, in many ways, I’m a very difficult student because I go along to demonstrations and talks by teaching advocates and educational specialists expecting them to really inspire me and teach me things. I go in with very high expectations and am very demanding in terms of authenticity. Occasionally, if I know the lecture theatre, I will deliberately sit in the worst place, to simulate what students would do. Now, this sounds really harsh, but if someone is going to talk to me about how to improve my teaching – then they have to be able to teach well, reach out to me, wherever I am and stop me from drifting off. (I make myself sound like an ogre – I do give people a lot of time and space to do their thing but, well, if you’ve sat through a bad teaching talk, you know what I’m talking about.)

Here’s the basic rule: If you’re going to talk the talk, you had… well, you know the rest.

I had the good fortune and pleasure of meeting Mark and Barbara the night before both of their talks, over dinner, and it very quickly established that both talks were going to be really interesting because it was quite obvious that the speakers were knowledgable, experienced and authentic. Both Mark and Barbara were talking within the framework of our Festival of Learning and Teaching, with Mark presenting “Introducing Computing with Media, with a Pedagogical Side Tour” and Barbara presenting “The Georgia Computes Outreach program”.

Over the course of his talk, Mark showed examples, played musical instruments, demonstrated software, did small programming exercises and, down the front of a multi-hundred seat lecture theatre filled with people from across a University, drew people in more and more. Sitting down the front, I had the opportunity to observe the crowd who were listening, avidly. Phones were away, laptops were being used for note taking and, even more amazingly, people from completely non-technical disciplines started asking programming questions. Sometimes I can’t get third-year Computer Science students to ask programming questions!

This is, basically, why you may find Mark’s blog interesting. His talk was based on things that had actually been done, or were being done, at Georgia Tech. They were authentic. His teaching techniques had obviously been well-practiced and his resources were well-used, well-prepared and worked. What he did made people think, question and wonder. He held the attention of a crowd of academics, sitting around in an average lecture theatre, from every discipline in the University, over the course of the talk, when everyone had many other things they could or should be doing.

Once again, my apologies to Mark for the semi-hagiographic tone. I had originally written this some time ago, as his talk made me think long and hard about my own teaching path and communicating my thoughts, and then he started following my blog, which meant that I shelved the post out of a combination of embarrassment and self-awareness. But, if you like my blog, Mark’s part of the reason that it’s here and, if you like this blog, I think you’ll really enjoy his.