nltk

S+7 through NLTK

S+7

One of the earliest Oulipian procedures is Jean Lescure’s S+7. While its status as a constraint is debatable (originally called a method, sometimes referred to as a procedure), it is one of the most cited and perhaps also least understood of the Oulipo’s long list of techniques.

To begin, S stands for “substantif” (noun), but can be theoretically replaced with any other part of speech. One of the founders of the Oulipo, François Le Lionnais, pointed out that S+7 is a more specific version of m±n, where m is a “meaningful” part of speech and n is any integer. Carrying out an S+7 or any of its variations should be a purely mechanical procedure. All an author needs are two very important pieces: a pre-written text and a dictionary. Then, the author identifies all the nouns and replaces them with the nouns that come seven entries later in a dictionary of their choosing. The result therefore depends upon the original text and the dictionary chosen, but not much else.

Example

In the bench Governor created the help and the economist.
And the economist was without forum, and void; and day was upon the failure of the deep. And the Spring of Governor moved upon the failure of the weddings.
And Governor said, Let there be link: and there was link.

(generated on http://www.spoonbill.org/n+7/)

The interest of this particular S+7 and indeed most of the Oulipo’s best-loved examples is that the original text (Genesis from the Bible) is extremely recognizable. It isn’t the dictionary that led to the hilarity of the result, but rather that even with the noun substitutions, the original text is still very much audible, but with unexpected new words. While the choice of the dictionary could have created more specific substitutions, the Oulipo has not really done much experimenting with the dictionaries — they have used big ones and small ones (and in the case of one Queneau S+7, a culinary one).

Natural Language Processing

For my digital humanities project, I am making my own S+7 program using nltk with python. While my earlier programming efforts were difficult for a beginner, trying my hand at nltk makes me feel like I’ve made it to another level entirely. Going through their online textbook has been very helpful and has reinforced the programming knowledge I have already gained through working on this project. Also, Natural Language Processing has helped me better understand the early constraints of the Oulipo, greatly contributing to my chapter on algebra which includes analysis of the S+7 and its variations, as well as other methods that are based on simple substitutions, counting, or operations.

I am pleased to report that I am putting the final touches on this last program, which will allow the reader to generate a dictionary based on one author’s vocabulary (the one I am currently working with takes all the nouns from Edgar Allan Poe’s collected poetry) and substitute those nouns into a short excerpt from several other recognizable texts (Moby Dick, The Declaration of Independence, Genesis, A Tale of Two Cities, and The Raven).

Once I have worked out the kinks in my pluralizing function (if the original noun is plural, I need the substituted one to be plural as well), I will publish the code online in my Github repository as well as here on CORE. While I do not believe that this code is particularly useful, the process of creating it was invaluable to me as a scholar and a programmer. I now understand the Oulipo and their computer efforts much better, as well as their elementary procedures. Programming texts that seem gimmicky, but that are hardly ever “read” (such as the Cent mille milliards de poèmes) has forced me to design new ways to read them. I have also gained new insights into the digital humanities and how it can be used not to produce an online archive or digital editions of texts (though, I have created interactive, digital editions of certain texts or procedures), but rather to open eyes to the possibilities in such experimental fiction. Works written using new methods must be analyzed using new methods. In that sense, it was the intellectual process of carrying out this project and not the process itself that I will take with me.

Digital Humanities Summer School

Thanks to a travel grant from the Center for Digital Humanities @ Princeton, I have just completed the intensive week-long Digital Humanities summer school at the OBVIL laboratory at La Sorbonne. OBVIL stands for the “Observatoire de la vie littéraire” or the observatory of literary life. After my Digital Oulipo project and continued work on the Oulipo Archival Project, I cannot agree more with the metaphor of an observatory. Digital Humanities allow researchers to examine from a distance, which complements the traditional literary scholarship of “close readings.” Now more than ever, I believe humanities scholarship needs both perspectives to succeed.

In this intensive and rich program, I was able to continue to develop my skills in XML-TEI that I had been learning through the Oulipo Archival Project. Furthermore, I discovered exciting new software such as TXM, Phoebus, Médite, and Iramuteq and how they can be used to learn more about large corpuses of text. My favorite part of this program was that it was a specifically French introduction to European developments in the digital humanities, allowing me to broaden my perspective on the discipline.

Here is a brief summary of what I learned day by day. I am happy to answer any specific questions by email. Feel free to contact me if you want to know more about the OBVIL summer school, the specific tools discussed there, or just about digital humanities.

Day 1

The first day of the summer school was a general introduction to the history of digital humanities methods and how to establish a corpus to study using these digital methods. It was especially interesting for me to learn the history of these methods I have been experimenting with for months. I had no idea that the Textual Encoding Initiative (TEI) had been invented in 1987, before I was even born, as a new form of “literate” programming.

Surprisingly, the most useful workshop was a basic introduction to the various states of digital texts. While I knew most of the types of digital documents already as a natural byproduct of using computers in my day-to-day life, it was useful to discuss the specific terminology (in French even!) used to describe these various forms of texts and the advantages and disadvantages of each. For instance, while I knew that some PDFs were searchable while others are not, it was still useful to discuss how to create such documents, the advantages of each, and how to move from one medium to another.

Day 2

The second day of the summer school began by asking the not-so-simple question of “what’s in a word?” In the following sessions, we learned about everything from how to analyze word frequencies in texts to how to treat natural language automatically, through tokenization (segmenting text into elementary unities), tagging (determining the different characteristics of those unities), and lemmatization (identifying the base form of words).  We then had specific workshops meant to introduce us to ready-made tools we could use to treat language automatically. We did not discuss NLTK, however, which I am currently using to program the S+7 method for my Digital Oulipo project, most likely because using NLTK requires a basic understanding of programming in Python, which was out of the scope of this short summer school.

The second half of this day was an introduction to text encoding, how it works and why it is useful for analyzing large corpuses. While I was already familiar with everything covered here, it was still interesting to hear about the applications of TEI to something other than the Oulipo archive. It was especially interesting to hear about applications of TEI to highly structured texts such as 17th century French theater.

Day 3

This day was extremely technical. First we looked at co-occurrences of characters in Phèdre as an example of network graphs. Since the main technical work had been done for us, it was somewhat frustrating to be confronted with a result that we had no part in creating. While as a former mathematician, I knew how to understand the content of a network graph, many other students did not and took its spatial organization as somehow meaningful or significant. This demonstrates a potential pitfall in digital humanities research. One needs a proper technical understanding of the tools and how they function in order to interpret the results with accuracy.

In addition to network graphs, we also discussed how to use the XPath feature in Oxygen (an XML editor) to count various elements in classical theater such as spoken lines by characters, verses, or scenes in which characters take part. Once again, it was interesting to see how a computer could facilitate such a boring manual labor and how it could potentially be of interest for a scholar, but interpreting such statistical aspects of large corpuses of text is tricky work for someone whose last statistics class was in high school. This gave me the idea to create a course that would properly teach students how to use these tools and understand the results through workshops.

Day 4

This was another ready-made tool workshop in which we discussed using OBVIL’s programs Médite and Phoebus to edit online texts more efficiently and find differences between different editions. This was very interesting, but probably more useful for publishing houses than for graduate students.

The rest of the day was meant to introduce us to Textometry using TXM, but there were far too many technical issues with the computers provided by the university that we spent the entire time downloading the software on our personal laptops. This was not only frustrating, but ironic. One would think that a summer school in digital humanities run mostly by computer scientists would not have such technical difficulties.

Day 5

The final day of the program (Friday the 9th was devoted to discussing our personal projects with the staff) continued the work on TXM. In fact, as my section had had such issues the previous day, I decided to switch into the other group. This was a good decision, as the head of that session was more pedagogical in his approach, assigning a series of small exercises to introduce us to TXM. By experimenting with tokenization using TreeTagger and concordance of words, we were able to begin to write a bit of code that could parse a text and find specific groups of words.

This introduction was practical and hands on, but I wish there had been more. While I now know vaguely how to use TXM to analyze texts, I do not have experience coming up with the questions that such techniques might help me answer. This seems to me the key to effective digital humanities scholarship — asking a solvable question and knowing which tools can help you resolve it.