I’ve just signed up for the Digital Humanities Winter Institute course on “Large-scale text analysis with R”. K read about it on ProfHacker and passed it on to me thinking I’d be interested. Of course, I was, but it goes well beyond learning R itself. R is a statistically focused programming package that is available for free for most platforms. It’s the statistical (and free, did I mention that?) cousin to the mathematically inclined Matlab.
I’ve spoken about R before and I’ve done a bit of work in it but, and here’s why I’m going, I’ve done all of it from within a heavily quantitative Computer Science framework. What excites me about this course is that I will be working with people from a completely different spectrum and with a set of text analyses with which I’m not very familiar at all. Let me post the text of the course here (from this website) [my bold]:
Text collections such as the HathiTrust Digital Library and Google Books have provided scholars in many fields with convenient access to their materials in digital form, but text analysis at the scale of millions or billions of words still requires the use of tools and methods that may initially seem complex or esoteric to researchers in the humanities. Large-Scale Text Analysis with R will provide a practical introduction to a range of text analysis tools and methods. The course will include units on data extraction, stylistic analysis, authorship attribution, genre detection, gender detection, unsupervised clustering, supervised classification, topic modeling, and sentiment analysis. The main computing environment for the course will be R, “the open source programming language and software environment for statistical computing and graphics.” While no programming experience is required, students should have basic computer skills and be familiar with their computer’s file system and comfortable with the command line. The course will cover best practices in data gathering and preparation, as well as addressing some of the theoretical questions that arise when employing a quantitative methodology for the study of literature. Participants will be given a “sample corpus” to use in class exercises, but some class time will be available for independent work and participants are encouraged to bring their own text corpora and research questions so they may apply their newly learned skills to projects of their own.
There are two things I like about this: firstly that I will be exposed to such a different type and approach to analysis that is going to be immediately useful in the corpus analyses that we’re planning to carry out on our own corpora, but, secondly, because I will have an intensive dedicated block of time in which to pursue it. January is often a time to take leave (as it’s Summer in Australia) – instead, I’ll be rugged up in the Maryland chill, sitting with like-minded people and indulging myself in data analysis and learning, learning, learning, to bring knowledge home for my own students and my research group.
So, this is my Summer Camp. My time to really indulge myself in my coding and just hack away at analyses and see what happens.
I’ve also signed up to a group who are going to work on the “Million Syllabi Project Hack-a-thon“, where “we explore new ways of using the million syllabi dataset gathered by Dan Cohen’s Syllabus Finder Tool” (from the web site). 10 years worth of syllabi to explore, at a time when my school is looking for ways to be able to teach into more areas, to meet more needs, to create a clear and attractive identity for our discipline? A community of hackers looking at ways of recomposing, reinterpreting and understanding what is in this corpus?
How can I not go? I hope to see some of you there! I’ll be the one who sounds Australian and shivers a lot.
Mark’s 1000th post (congratulations again!) and my own data analysis reminded me of something that I’ve been meaning to do for some time, which is work out how much I’ve written over the 151 published posts that I’ve managed this year. Now, foolish me, given that I can see the per-post word count, I started looking around to see how I could get an entire blog count.
And, while I’m sure it’s obvious to someone else who will immediately write in and say “Click here, Nick, sheesh!”, I couldn’t find anything that actually did what I wanted to do. So, being me, I decided to do it ye olde fashioned way – exporting the blog and analysing it manually. (Seriously, I know that it must be here somewhere but my brain decided that this would be a good time to try some analysis practice.)
Now, before I go on, here are the figures (not including this post!):
- Since January 1st, I have published 151 posts. (Eek!)
- The total number of words, including typed hyperlinks and image tags, is 102,136. (See previous eek.)
- That’s an average of just over 676 words per post.
Is there a pattern to this? Have I increased the length of my posts over time as I gained confidence? Have they decreased over time as I got busier? Can I learn from this to make my posting more efficient?
The process was, unsurprisingly, not that simple because I took it as an opportunity to work on the design of an assignment for my Grand Challenges students. I deliberately started from scratch and assumed no installed software or programming knowledge above fundamentals on my part (this is harder than it sounds). Here are the steps:
- Double check for mechanisms to do this automatically.
- Realise that scraping 150 page counts by hand would be slow so I needed an alternative.
- Dump my WordPress site to an Export XML file.
- Stare at XML and slowly shake head. This would be hard to extract from without a good knowledge of Regular Expressions (which I was pretending not to have) or Python/Perl-fu (which I can pretend that I have to then not have but my Fu is weak these days).
- Drag Nathan Yau’s Visualize This down from the shelf of Design and Visualisation books in my study.
- Read Chapter 2, Handling Data.
- Download and install Beautiful Soup, an HTML and XML parsing package that does most of the hard word for you. (Instructions in Visualize This)
- Start Python
- Read the XML file into Python.
- Load up the Beautiful Soup package. (The version mentioned in the book is loaded up in a different way to mine so I had to re-enage my full programming brain to find the solution and make notes.)
- Mucked around until I extracted what I wanted to while using Python in interpreter mode (very, very cool and one of my favourite Python features).
- Wrote an 11 line program to do the extraction of the words, counting them and adding them (First year programming level, nothing fancy).
A number of you seasoned coders and educators out there will be staring at points 11 and 12, with a wavering finger, about to say “Hang on… have you just smoothed over about an hour plus of student activity?” Yes, I did. What took me a couple of minutes could easily be a 1-2 hour job for a student. Which is, of course, why it’s useful to do this because you find things like Beautiful Soup is called bs4 when it’s a locally installed module on OS X – which has obviously changed since Nathan wrote his book.
Now, a good play with data would be incomplete without a side trip into the tasty world of R. I dumped out the values that I obtained from word counting into a Comma Separated Value (CSV) file and, digging around in the R manual, Visualize This, and Data Analysis with Open Source Tools by Philipp Janert (O’Reilly), I did some really simple plotting. I wanted to see if there was any rhyme or reason to my posting, as a first cut. Here’s the first graph of words per post. The vertical axis is the number of words and the horizontal axis is the post number. So, reading left to right, you’ll see my development over time.
Sadly, there’s no pattern there at all – not only can’t we see one by eye, the correlation tests of R also give a big fat NO CORRELATION.
Now, here’s a graph of the moving average over a 5 day window, to see if there is another trend we can see. Maybe I do have trends, but they occur over a larger time?
Uh, no. In fact, this one is worse for overall correlation. So there’s no real pattern here at all but there might be something lurking in the fine detail, because you can just about make out some peaks and troughs. (In fact, mucking around with the moving average window does show a pattern that I’ll talk about later.)
However, those of who you are used to reading graphs will have noticed something about the axis label for the x-axis. It’s labelled as wp$day. This would imply that I was plotting post day versus average or count and, of course, I’m not. There have not been 151 days since January the 1st, but there have been days when I have posted multiple times. At the moment, for a number of reasons, this isn’t clear to the reader. More importantly, the day on which I post is probably going to have a greater influence on me as I will have different access to the Internet and time available. During SIGCSE, I think I posted up to 6 times a day. Somewhere, this is lost in the structure of the data that considers each post as an independent entity. They consume time and, as a result, a longer post on the same day will reduce the chances of another long post on the same day – unless something unusual is going on.
There is a lot more analysis left to do here and it will take more time than I have today, unfortunately. But I’ll finish it off next week and get back to you, in case you’re interested.
What do I need to do next?
- Relabel my graphs so that it is much clearer what I am doing.
- If I am looking for structure, then I need to start looking at more obvious influences and, in this case, given there’s no other structure we can see, this probably means time-based grouping.
- I need to think what else I should include in determining a pattern to my posts. Weekday/weekend? Maybe my own calendar will tell me if I was travelling or really busy?
- Establish if there’s any reason for a pattern at all!
As a final note, novels ‘officially start at a count of 40,000 words, although they tend to fall into the 80-100,000 range. So, not only have I written a novel in the past 4 months, I am most likely on track to write two more by the end of the year, because I will produce roughly 160-180,000 more words this year. This is not the year of blogging, this is the year of a trilogy!
Next year, my blog posts will all be part of a rich saga involving a family of boy wizards who live on the wrong side on an Ice Wall next to a land that you just don’t walk into. On Mars. Look for it on Amazon. Thanks for reading!