In 1955, L. Sprague de Camp published a posthumous collaboration with Robert E. Howard, Tales of Conan, consisting of four non-Conan stories written by Howard that were rewritten by de Camp as Conan stories. This was followed by Conan the Adventurer in 1966 which contained 3 stories by Howard (edited by de Camp) and 1 story started by Howard and finished by de Camp. This was to become the first (in order of publication) in a 12 volume series. The remaining 11 books in this series contain a mixture of stories written by Howard and edited by de Camp started by Howard and finished by de Camp, written by Howard (as non-Conan stories) and adapted (to Conan stories) by de Camp, and written by de Camp and Lin Carter (with the exception of one story started by Howard and finished by Carter and a sole story written by Bjorn Nyberg and L. Sprague de Camp.)
In this 12 volume series, L. Sprague de Camp provided a chronological ordering for these stories (which Howard did not do) beginning with a teenage Conan and ending with a Conan in his 60s. The publication order does not match de Camp’s in-story chronological ordering; volumes 1-10 and 12 were published by Lancer between 1966 and 1968 (the first book published being volume 3 in de Camp’s chronological ordering). However, Lancer went out of business before publishing the final book, volume 11 in de Camp’s ordering. This book, Conan of Aquilonia, was published in 1977 by Prestige. Ace publications distributed and reprinted the entire 12 volume series. As such, this series is often referred to as the Lancer-Ace series.
In the Blade of Conan de Camp stated that “In completing the unfinished Conan stories…I have tried to adhere to the style and spirit of Howard….readers may amuse themselves by guessing where, in stories like “Drums of Tombalku” and “Wolves Beyond the Border,” Howard left off and I began.”1 At this point we know don’t need to guess, as we have copies of Howard’s drafts, but I have done some work in the field of unsupervised authorship attribution and thought it would be interesting to see if I could determine where the split occurred using a slightly non-traditional method – namely, mathematics and statistics.
I can almost hear the clicking as people leave this page after that last line, so I want to give you my assurance that you will not need to understand (or even tolerate) any mathematical concepts to grok this post. I will leave the gritty details out, focusing on the results.
Stylometry is the study, and often quantification, of stylistic differences in written language. Typically, stylometric analyses are used in conjunction with more traditional methods for determining style and authorship, such as literary critiques from individuals knowledgeable regarding works by the disputed authors. For example of recent usage, in 2013 stylometric techniques were used in the Sunday (UK) Times’s investigation of author Robert Galbraith, first-time novelist and author of The Cuckoo’s Calling. The paper had received an anonymous tip via Twitter that Robert Galibraith was really a pen name for J.K. Rowling, and stylometric analysis lent evidence supporting this claim. The analysis was included in the materials the Times sent to Rowling’s agent when they asked, directly, whether she was the author of The Cuckoo’s Calling. Less than a day later Rowling confirmed that she was the author.2
So how does stylometry do this? Well, writing contains a lot of choices. Some of these choices are intentional and some arise from dialects, but there are other choices that are subconscious and don’t have an obvious explanation. Some of these subconscious choices remain static across an author’s writings. If we have two possible authors for an unknown piece, we can compare prior writings by these authors to the unknown piece, looking at these subconscious choices. (As a side note, we can extend the concept of “author” beyond the person doing the writings. I have shown that stylometric techniques successfully distinguish Isaac Asimov’s Foundation stories written in the ‘40s and ‘50s from those he wrote in the ‘80s and ‘90s using the same methods I will describe in a bit. So, here, the two authors are both Asimov.)
A couple of questions immediately arise. For example, what are the subconscious traits and how reliable are the results? This is actively being researched, but a common subconscious trait is the use of function words in writing. Function words are words with little lexical meaning, but instead serve as a type of glue for the other words. Examples are words like “of” and “the.” On their own they have little meaning, but they are used to form relationships among the other words in a sentence. Since these are often used without much thought, the idea is that they demonstrate subconscious writing trends. (Maybe one author uses “on the left” while another “to the left”.) If you look at the use of, say, the 75 most common function words and compare their uses among the contested piece and sample writings, you can make a case for authorship.
It would be easy to think that these common function words, often referred to as the Most Frequent Words (MFWs) is a terrible metric with little meaning in determining authorship, but they have been shown useful in this regard. It makes a kind of sense since these words are typically chosen with little thought and indicate trends in sentence structure.
It is important to remember that the methods I use will (pretty much) always indicate one author in a head-to-head contest over the other, regardless of whether either author is the true author. We are simply choosing a particular measurable object (function word usage) in writing and comparing the measurements for the unknown piece to the measurements in sample writings by potential authors. Whoever differs least, “wins.” (So, there are a lot of caveats: Does the “measurable thing” really distinguish authors? Do we know the authors we choose to compare are the only ones who could have written the piece?)
I’m going to brush all of those nasty little questions aside for now and just do some stuff.
To demonstrate the potential of this method, let’s examine an authorship question for which we already know the answer. This may seem like a waste of time…why do a metallurgical analysis of a penny and a nickel to tell them apart when we already know which is which? The reason is quite simple: If we want to use these methods on actual authorship questions, it is necessary to see if the methods have any merit.
This demonstration will involve the finished story “Drums of Tombalku,” with the goal of finding the “hand-off” spot in the story. This story makes an ideal example because Howard’s writing comprises the first part of the story with few changes by de Camp, while the second part of the story is the sole work of de Camp. An important initial question is whether this technique can distinguish the writing of de Camp and Carter from that of Howard. (I’ll address the issue of de Camp sans Carter in a bit.) Using the 75 MFWs a Hierarchical Tree Diagram was created.3 This diagram shows how the texts line up in terms of MFW usage. Here, a story by Howard is marked as H_Title and a story by de Camp and Carter is DC_Title. The diagram is fairly intuitive; similar stories are “closer” in the tree structure. Here, we clearly see two large “clusters” that do a solid job distinguishing Howard from de Camp and Carter. The one exception is The Thing in the Crypt, which shows up in the Howard clump. There are potential reasons for this, including the authorship, since this piece is based on a draft by Carter for his Thongor character. Discussing this one outlier could easily be its own blog post, so I’ll leave it alone.
If we accept that these methods, for the most part, reliably distinguish Howard from de Camp/Carter, we can move on to the issue of finding the split in “Drums of Tombalku.” Those familiar with this piece may have already caught a potential snag – namely, that I’ve been comparing Howard to the combined authorship of de Camp/Carter, but Carter was not involved in finishing Drums. This means that we really need a sample of “pure” de Camp for comparison. Unfortunately, this is a bit difficult since I need the copy in digital format and the piece should be similar in terms of publishing date, genre, etc. I chose to use three pieces for comparison, each with faults and strengths:
1) Conan of the Isles – This is not ideal as it is de Camp and Carter.
2) Conan and the Spider God – This is credited as a solo de Camp piece (the only Conan story with this distinction), but it was written 14 years after Drums and some believe that de Camp’s wife, Catherine, helped with the writing of this book.
3) The Eye of Tandyla – This is not ideal as it is not a Conan story and was written over a decade before “Tombalku” saw publication, but it has been described as being “in the Conan tradition in every sense of the word”4 and is pure, uncut de Camp.
A stylometric procedure known as Rolling Delta is used to find authorship changes in a single text. Basically, it “rolls” through the text examining “chunks” to see which author is the most likely author for that “chunk”. As an example, if I choose to “roll” through my text with a sample size of 2000 and a step size of 100, we would compare words 1 through 2000 in Drums to samples by both authors to determine which author is closer in terms of MFW usage. Each author is assigned a number; I’ll skip explaining exactly what this number means, but it suffices to say that the author with the lower number is the most likely author (according to this method). Next, words 101 through 2100 would be examined and again the most likely author would be determined. We “roll” through the text until we reach the end. A visual is created that represents the results by plotting and connecting the output numbers, creating two “graphs”, one for each author. When the lines cross we have a switch in authorship.
Here are the results using three Howard stories to represent Howard’s writing (Tower of the Elephant, People of the Black Circle, and Red Nails) and one book to represent de Camp’s style, with that book being Conan of the Isles, Conan and the Spider God, or The Eye of Tandyla.
When looking at these graphs it is important to realize that we have great samples of Howard writing, but the samples used for de Camp are shaky. This means that rolling delta should do a pretty good job of identifying the pieces where Howard is the main author, but when de Camp is the main author, rolling delta may struggle a bit to classify the authorship. However, that is not a problem since we are mainly interested in seeing if we can identify the split. When compared to Conan of the Isles the split occurs a bit before the 10,000-word mark while for Conan and the Spider God and The Eye of Tandyla the split is a bit after the 10,000-word mark. Given the scale, it would seem that the split occurs between the 9,500- and 10,500-word mark. (In all three something appears to be happening a bit before the 10,000-word mark, even if the actual crossing occurs a bit later.)
We have copies of the manuscripts and know where the split actually occurs: Howard’s original draft is roughly 9826 words. With de Camp’s editing, it ends up being 9,888 words in the finished piece.5
So what does this all mean? Well, I’d argue that this supplies some evidence that unsupervised methods can distinguish de Camp from Howard. If we didn’t know where the split occurred, this would give a nice starting point for investigation. And the amazing thing is that this is only using one very simple measure, the MFW count. Those innocuous little words that we pay little attention to actually say a lot about how we write.
- de Camp, L. Sprague. “Editing Conan.” The Blade of Conan. Ed. L. Sprague de Camp. New York: ACE Books, 1979. p.120
- For those interested, a brief write-up of this can be found here: http://chronicle.com/article/The-Professor-Who-Declared/140595/
- The statistics were conducted and images produced using Stylo, a free software package for the R statistical programming language.
- As cited in “Galaxy’s 5 Star Shelf”, Galaxy Science Fiction, June 1954, p.122
- These word counts are from my digital copies; if there are any transcription errors the counts could be off by a few words.