I’ve just signed up for the Digital Humanities Winter Institute course on “Large-scale text analysis with R”. K read about it on ProfHacker and passed it on to me thinking I’d be interested. Of course, I was, but it goes well beyond learning R itself. R is a statistically focused programming package that is available for free for most platforms. It’s the statistical (and free, did I mention that?) cousin to the mathematically inclined Matlab.
I’ve spoken about R before and I’ve done a bit of work in it but, and here’s why I’m going, I’ve done all of it from within a heavily quantitative Computer Science framework. What excites me about this course is that I will be working with people from a completely different spectrum and with a set of text analyses with which I’m not very familiar at all. Let me post the text of the course here (from this website) [my bold]:
Text collections such as the HathiTrust Digital Library and Google Books have provided scholars in many fields with convenient access to their materials in digital form, but text analysis at the scale of millions or billions of words still requires the use of tools and methods that may initially seem complex or esoteric to researchers in the humanities. Large-Scale Text Analysis with R will provide a practical introduction to a range of text analysis tools and methods. The course will include units on data extraction, stylistic analysis, authorship attribution, genre detection, gender detection, unsupervised clustering, supervised classification, topic modeling, and sentiment analysis. The main computing environment for the course will be R, “the open source programming language and software environment for statistical computing and graphics.” While no programming experience is required, students should have basic computer skills and be familiar with their computer’s file system and comfortable with the command line. The course will cover best practices in data gathering and preparation, as well as addressing some of the theoretical questions that arise when employing a quantitative methodology for the study of literature. Participants will be given a “sample corpus” to use in class exercises, but some class time will be available for independent work and participants are encouraged to bring their own text corpora and research questions so they may apply their newly learned skills to projects of their own.
There are two things I like about this: firstly that I will be exposed to such a different type and approach to analysis that is going to be immediately useful in the corpus analyses that we’re planning to carry out on our own corpora, but, secondly, because I will have an intensive dedicated block of time in which to pursue it. January is often a time to take leave (as it’s Summer in Australia) – instead, I’ll be rugged up in the Maryland chill, sitting with like-minded people and indulging myself in data analysis and learning, learning, learning, to bring knowledge home for my own students and my research group.
So, this is my Summer Camp. My time to really indulge myself in my coding and just hack away at analyses and see what happens.
I’ve also signed up to a group who are going to work on the “Million Syllabi Project Hack-a-thon“, where “we explore new ways of using the million syllabi dataset gathered by Dan Cohen’s Syllabus Finder Tool” (from the web site). 10 years worth of syllabi to explore, at a time when my school is looking for ways to be able to teach into more areas, to meet more needs, to create a clear and attractive identity for our discipline? A community of hackers looking at ways of recomposing, reinterpreting and understanding what is in this corpus?
How can I not go? I hope to see some of you there! I’ll be the one who sounds Australian and shivers a lot.