Monday, December 10, 2012

Mathematical Methods for Comparing Document Content

This post describes two different ways to calculate a rough measure of how much two documents cover the same material or the same topic. This is mainly written for those who want technical details of how the calculations work; it may not be of interest to other readers. The two methods are a "shared word estimate" and a "shared emphasis estimate".

Two Sample Lists

Imagine two very short books with the following word frequency lists: 

Document #1
  1. fun (15)
  2. Dick (12)
  3. Jane (10)
  4. Spot (6)
  5. see (4)
  6. run (3)

Document #2:
  1. fun (28)
  2. Fred (27)
  3. Jane (25)
  4. Dick (13)
  5. catch (4)
  6. Spot (3)

(I haven't really run the numbers for any actual "Dick and Jane" early-reader books; these numbers are made up for the purposes of illustration.) The two measures that I'll calculate show the percentage of shared words on these lists, and then a more detailed comparison of their emphasis.

Calculating A Shared Word Estimate

For the shared word estimate -- a rough estimate of whether the documents cover similar material or subject matter -- we run a basic count of the words shared between the two lists, and compare that to the length of the list. In this simple example, each list has 6 words, shares 4 words with the other list, and contains 2 words not found on the other list. So the shared word estimate tells us that 4/6 (67%) of the common words are the same between the two lists. The shared word estimate is crude, but can be used as a first estimate of whether a more detailed comparison is in order. You can determine, mathematically or by computer analysis, that these two documents may be related. If you saw a book with another top words word list, like "eggs, green, ham, am, Sam, like", you would find 0% in common and could expect that this document was not covering the same material or narrative.

A quick look at the shared word estimate shows that there is room for improvement, though. If the top, most common word is the same on both lists, there is a higher chance that they are on the same topic than if the bottom words happen to match. A more detailed comparison is in order that takes things like that into account.

Calculating A Shared Emphasis Estimate

The "shared emphasis estimate" measures not only whether both documents use the same words commonly, but considers whether those words occur about as commonly: it measures emphasis as well. Here the first approach I tried based on word rank (how high a word scores on the list) had to be discarded, as there were significant problems with the validity of the result. Simply comparing the rank of each word from one list to the next did not account for the fact that some lists have near-ties at some places, while others have steep drop-offs in word frequency, meaning that the ranking number was not an especially clean measure of the commonness of a word. The longer the list, the greater the problem that would be presented. Then there was a question of how much to weight the first-ranked word compared to the second-ranked, and so on down the list. If we used the rank as a basis for the weight, it would introduce an inflexible and artificial scale. The more fitting method is to weight each word based on its prevalence within the documents in question.

To determine the weight for each word, then, first a total was run of all the word-occurrences in the list. Then each individual word's usage count was turned into a percentage of that total. Here are our two sample documents again, with those calculations shown:

Document #1: 50 total count for the words in the "common words list":
  1. fun (15): 15/50 = 30%
  2. Dick (12): 12/50 = 24%
  3. Jane (10): 10/50 = 20%
  4. Spot (6): 6/50 = 12%
  5. see (4): 4/50 = 8%
  6. run (3): 3/50 = 6%
Document #2: 100 total count for the words in the "common words list":
  1. fun (28): 28/100 = 28%
  2. Fred (27): 27/100 = 27%
  3. Jane (25): 25/100 = 25%
  4. Dick (13): 13/100 = 13%
  5. catch (4): 4/100 = 4%
  6. Spot (3): 3/100 = 3%
To calculate the Shared Emphasis Estimate, we take each word's emphasis percentage in the first document as our starting point. Comparing it to the second document, we subtract out the difference in how much it is emphasized there to find the shared emphasis between the documents.

For example, "fun" has 30%  value in the first document, but 28% in the second. The difference in emphasis is 2%. So the shared emphasis is 30% - 2%, or 28%, based on the first word. The other words are also added into the result.

A slight miscalculation: This calculation had to be refined because of problems in whether it was actually measuring what was intended. Originally the calculation used the absolute value of the difference, then adjusted the original amount by that, using the calculation below. Here it shows the calculation for each of the six words listed for Document1 compared to Document2, and uses "abs()" rather than "||" to mean absolute value:
  1. fun: 30 - abs(30-28), or 30 - 2, = 28.
  2. Dick: 24 - abs(24-13), or 24 - 11, = 13.
  3. Jane: 20 - abs(20-25), or 20 - 5, = 15.
  4. Spot: 12 - abs(12-3), or 12 - 9, = 3.
  5. see: 8 - abs(8-0), or 8 - 8, = 0. (Not a shared word.)
  6. run: 6 - abs(6-0), or 6 - 6, = 0. (Not a shared word.)
Totaling those numbers, we get 28+13+15+3+0+0 = 59% shared emphasis estimate. The Shared Emphasis Estimate typically will be lower than the cruder Shared Words Estimate. This is because the Shared Words Estimate would give full weight to a match between the least-used word and the most-used word, and takes no account of differences in emphasis.

Updated calculation: The original calculation worked acceptably well for the simple and hand-made examples above, but when comparing actual documents some problems appeared. Consider percentages like the following for a pair of words:

List1: 2%, List2: 15%.  

The absolute value of the difference is 13%, and subtracting 13% from 2% we get -11%. It is then possible for a pair of words to have a negative impact, even when it appears in a significant way in both documents. Based on what I am intending to measure, the number that should be used is simply 2%, the smaller of the two numbers.

Or consider the following example:

List1: 2%, List2: 3%.  

The absolute value of the difference is 1%, and subtracting 1% from 2% we get 1%. But each document has at least 2% value for that word, so it is a more accurate reflection of what I'm intending to measure if the shared value is 2%.

The refinement to the calculation is to leave out the absolute value of the difference, and simply take the smaller of the two numbers for any given pair. This will be zero when the word is on one list but not the other, but it will never be less than zero. 
  1. fun: lesser of 30 or 28: 28.
  2. Dick: lesser of 24 or 13: 13.
  3. Jane: lesser of 20 or 25: 20.
  4. Spot: lesser of 12 or 3: 3.
  5. see: lesser of 8 or 0: 0 (Not a shared word.)
  6. run: lesser of 6 or 0: 0 (Not a shared word.)
Figuring the totals again: 28 + 13 + 20 + 3 + 0 + 0 = 64% for the shared emphasis estimate. For most of the pairs the result was the same, but now the "shared emphasis" is never less than the smaller of the two amounts, which is a more accurate measure of what that calculation is intended to show.

Further Refinements

Here I worked with two very basic (and fictitious) sample documents, where I had the prerogative of selecting the values used for the example. In real documents, another question is significant: how many words do we compare? Here we compared six words, but that was arbitrary. What is a sound method for determining how many words to include in the comparison?

Since this method is generally intended for longer works, my starting point is this: each word is added to the list in order of decreasing usage, with the most-used word being added first, followed by the second most-used word, and so forth. (Some structural words such as articles and conjunctions are typically filtered out during word counts.) When adding each new word to the word list, keep going so long as the current new word, if included, would have a value of 1% or more of the total. Once the next word would be less than 1% of the total, that's probably the point at which the additional comparison doesn't refine the result enough to be relevant. In this way, the number of words included in a list is not an arbitrary number, but is sensitive enough to respond to the different word usage characteristics of each document. At the same time, the measure remains objective to the point where the calculation could be done, content-blind, by a computer program.

I worked out the methods for calculating a Shared Emphasis Estimate by using hypothetical sample books until I had a method with an objective basis (one that could be turned into a computer program that is indifferent to the content, given the time to write the code), and that gave reasonable results.

A few potential design problems may need work. First, the 1% rule is for documents of substantial length; it is possible that it would need some amendment for shorter documents such as our mini-documents used as test cases above. I have not yet tried to compare shorter works, but the lists above suggest the problem could be real and, on a short enough document, some rarely-used words would be included simply because the word totals never reached 200, which is the tipping point for excluding words used only twice. It's possible that more of a "bell curve" approach might eventually replace the 1% rule, as something more easily scalable to different sizes of document.

Also, in larger documents especially, there may be ties in how frequently words are used: that is, more than one word might be used at the same frequency. This is common enough in larger documents, and it can happen right at the 1% boundary. In such a cluster of words of the same frequency, it is possible that the first would meet the 1% rule but the last would not if we had already added in that previous word of the same frequency. In that case, it makes no sense to show a preference for one word over another when both have the same frequency. That is to say, if the first word of a certain frequency is included under the 1% rule, then all other words of the same frequency would be included on the list because of their frequency, even if the resulting final percentage for those words might be slightly under 1% when the whole group of words is included.

A future area for exploration would be: how much can we tell about a document's content from this kind of analysis? For example, would a biography typically have the subject's name at the top of the word-frequency list? I would also be curious how different types of political and persuasive material would look, and what kinds of emphasis became apparent. I'd also see some potential for targeted word frequencies: for example, words that frequently appeared only in one portion of a document, or throughout a document but only while discussing only one recurring topic.


Next we will see how the basic approach works with actual documents instead of hypothetical ones. But that will wait for another post.

No comments: