Wednesday, October 22, 2014

Oh, the Digital Humanity!


Why, yes, I am a command-prompting computer wizard. Why, just look at my DATA! See all that Data over there? Yes. Mine. I did that. With... CODE MAGIC.

...or something.

Although this screenshot (in my opinion) makes me look like a computer wizard, the question it answers is fairly simple. I set out to find a program that could tell me which words within the text of Gogol's Dead Souls occurred most frequently in each chapter. This is called "Topic Modelling," looking at word-frequency to build a set of topic words for different sections of text (in this case, chapters of a book). The program I found was called "Mallet," which seemed the most likely to get the job done.


The biggest problem that I encountered (which I encountered with various DH programs and tools that I tried out, mainly Voyant and Tagxedo) was in working with Russian text. Cyrillic characters can be a problem, of course, but the structure of Russian words makes them harder to map (even if the program can process Cyrillic). Because the Russian grammatical system works by changing the endings of words, most programs won't recognize two words as identical if they are used in different grammatical constructions. The word девушка, for instance, means "girl," but can be written as девушка, девушку, девушкой, девушке, or девушки depending on the grammatical context. although these are all the same word, a program that analyzes words as entire units won't see it that way, and my data becomes useless.

So, is there a solution? For those of you who speak Russian and read Cyrillic, you will notice in the top left-hand side of my screen shot is some unintelligible text --- this is a modified version of Gogol's Dead Souls, from which I attempted to remove all of the grammatical endings. Although this seemed like a good idea, it did not solve the problem of stop-words. Of course, the list of stop-words in all of the programs I worked with were English, which did me no good when working with a Russian text. After scrubbing the text clean of endings and finding lists of Russian stop-words to add to my program (the series of folders in the upper right-hand corner) via the command prompt (lower right-hand corner), I came up with a messy pile of word-pieces that hardly represented the data that I was looking for.

But was time wasted? Definitely not. The squeaky-clean version of my Russian text came in handy for making word-clouds, just to see which words showed up most often within the entire novel. Interesting to see, and potentially useful, but not exactly what I was going for --- I wanted chapter-by-chapter division. After getting the hang of the Mallet program and some basic command-prompting, I just entered a full English translation of the text to be Topic-mapped, and the results are the lists of English words that take up most of my screen-shot. I had expected to see some food-related words up there, and was happy to see that "sturgeon," "dish," and "egg" made the cut, but I wasn't blown away by the results as a whole. It was cool to try topic-mapping, and I think this could definitely be useful once I get a better idea of what this program is capable of.

In the meantime, I'll stick to what I know. With regard to the Russian text, and finding instances of food-vocabulary, I took 2 minutes and came up with the gallery of images below. I literally just hit Ctrl+f to find where in the Russian text I could find different food-related words like bread, fish, sturgeon, sugar, pastry... AND (is it just my computer that is this awesome, or is it google chrome, or what?) when I search that way, a little side-bar shows up that highlights where exactly in the text each example is found, and i can see the distribution in little yellow highlights. This is just an example of how DH-applicable extras have already been integrated into systems that we use all the time. Sad to say, but I think this was more interesting/useful/relevant than the hours I spent learning how to code and topic-map... but I will not be discouraged.


Conclusion: "Screwing around" with a DH tool is all good and well, if you just want to learn the capabilities of that tool and see how it MIGHT be applicable to your research. All of the tools I tried were fun, interesting, and could have potential use in my project, or future projects. If you already know what you're looking for, however, finding the right DH tool can be a problem... but it doesn't have to be as complicated as you think!

#DH4lyfe

No comments:

Post a Comment