## Monday, December 31, 2012

### Best of the Blogroll 2012

Here to ring out the old year are my favorite posts written by my dedicated blog-neighbors from the year 2012. Thank you all for your dedication and continuing to write!

Some of the blogs on my blogroll have gone inactive during the year. A few others had a steady stream of solid posts but there was no one particular post that caught my eye. For those of you who have gone inactive: I hope to see you back in 2013. And to everyone on the blogroll: Thank you for blogging!

## Friday, December 28, 2012

### Comparing Mark and John with Mathematical Models

Thank you for your patience with these document comparisons. We're getting close to my being able to show you some more interesting things you can see with the comparisons, but wanted to at least get all the Biblical gospels into the mix before we started going beyond them. So for the fourth gospel, here are results from comparing Mark with John.

The short version of the results

I did have a chance to work through the problems with the calculation and to make them more sound, where two shared words will now never have a negative impact on the comparison. I'm now simply using the smaller of the two numbers for any word pair, which works out to 0 when the word isn't on both lists. I'll also be updating the previous documents with the corrected calculations.

Shared Word Estimate (22/48) = 46%
Shared Emphasis Estimate 54%

There is less similarity between Mark and John than we previously saw between Mark and Matthew or Luke. In the notes on the Shared Emphasis Estimate, I'll include some notes on where the differences are found.

Notes on the Shared Word Estimate

Again, Mark is the shorter document. It has 48 words included in the high-frequency word list, which is limited to words that would make at least a 1% difference in the total as discussed previously. Of those 48 words, only 22 are also in John's high-frequency words list calculated in the same way, which is the lowest match rate we have seen yet among the gospels. So 22/48 = 46%, rounded to the nearest whole number. Again, since the percentages involved are already effectively rounded when we leave out low-frequency words, it does not seem warranted to use a lot of decimals in the percentage.

Notes on the Shared Emphasis Estimate

With Mark and John, , the highest-frequency word in both documents is "Jesus". But the differences start as early as the second word on the list, where "man" is second in Mark's but "father" is second in John's. For the first time in our comparisons, even though John is the longer document, its high-emphasis words list is actually shorter at 44 words. This is an objective, verifiable measure of what people have long perceived about the fourth gospel: John's perceptions are more distilled or filtered, more focused -- possibly more edited, or more selective.

When we look at where the differences occur, there are some points of interest. Again, the comparisons is done from the perspective of Mark's gospel; other differences would come to light when using John as the baseline. When comparing Mark's top words to John's, there are 26 that are not on John's top words list; they are listed in the order of their importance in Mark's word list: crowd, teachers, around, anyone, began,  took, house, law, against, kingdom, mother, boat, hands, eat, days, lord, children, heaven, others, sitting, twelve, chief, evil, hear, James, looked. When we follow the leads that are given here, we might find fewer crowd scenes and fewer action scenes in John than in Mark.

Then there are the words on both lists that are emphasized noticeably less in John than in Mark: people and man. Again, this adds weight to the possibility that we'll find measurably fewer crowd scenes and action scenes in John.

The histories passed down about the Gospel of John mention that it was written to supplement the previously-written gospels. One way this may be seen is Mark's relatively greater emphasis on Jesus' public life, and John's relatively greater emphasis on private moments.

## Tuesday, December 25, 2012

### Christmas

Today I will content myself with some thoughts from a far abler commenter on Scripture than I am:
Joseph was of the lineage of David and had to go to Bethlehem, the city of David. ... We can see how poor Joseph must have been that he could not afford to hire some old woman or neighbor to stay with Mary and look after her while he was gone.

How unobtrusively and simply do those events take place on earth that are so heralded in heaven!
(From an advent or Christmas sermon by Martin Luther, in a book where it is not carefully sourced so I'm not sure exactly which sermon, or where to find it in his larger collected works.)

Merry and blessed Christmas to all.

## Tuesday, December 18, 2012

### Comparing Mark and Luke with Mathematical Models

I promise there is a point to these document comparisons. I haven't yet calculated all of the comparisons that I intend, but I have read the documents in question, and I have no doubt that an objective, computer-based comparison like this will turn up interesting results. In the meantime, I did notice a few things when comparing Mark with Luke that might interest the general reader.

The short version of the results

Shared Word Estimate 65%
Shared Emphasis Estimate 64%*
* The originally listed number of 53% had some problems where, for word pairs with large differences, the shared word value might be less than the smaller of the two numbers or even negative. This number should be a more solid reflection of what is shared between the two documents.

There is less similarity between Mark and Luke than we previously saw between Mark and Matthew. In the notes on the Shared Emphasis Estimate, I'll include some notes on where the differences are found.

Notes on the Shared Word Estimate

Mark is a shorter document and has 48 words included in the high-frequency word list, which is limited to words that would make at least a 1% difference in the total as discussed previously. Of those 48 words, 31 are also in Luke's high-frequency words list calculated in the same way. So 31/48 = 65%, rounded to the nearest whole number. Again, since the percentages involved are already effectively rounded when we leave out low-frequency words, it does not seem warranted to use a lot of decimals in the percentage.

Notes on the Shared Emphasis Estimate

Again, the two highest-frequency words are the same between the two documents: "Jesus" and "man". And again Luke's list is broader than Mark's: it contains 52 words in the high-frequency list. When we look at where the differences occur, there are some points of interest.

When comparing Mark's top words to Luke's, there are 17 that are not on Luke's top words list: son, around, anyone, mother, Peter, boat, hands, eat, days, others, sitting, truth, twelve, chief, evil, James, and looked. Then there are the words emphasized noticeably less in Luke than in Mark: Jesus (though still by far the top word) and disciples. Some of the less-used words are related: Peter, twelve, James, and disciples. There seems to be noticeably less emphasis on the disciples in Luke than in Mark. That is consistent with early accounts that Luke was a companion of Paul's, showing less interaction with Jesus' disciples than is found in Mark.

I have noticed one problem with the calculations up to this point: the original calculation can cause two shared words to have a negative net effect, if the difference between the frequencies is larger than the original frequency itself. It may give more accurate results to simply use the smaller of the two frequency scores for the words in question, which may be 0 if the word is not found in the second document. At any rate I will finish up a few more sample comparisons before trying any updates to the calculation.

## Thursday, December 13, 2012

### Comparing Mark and Matthew with Mathematical Methods

Here is the first analysis of actual documents with the mathematical models discussed previously. I've taken my first document as the Gospel of Mark and the second as the Gospel of Matthew, using the word clouds linked here.

The short version of the results

Shared Word Estimate 77%
Shared Emphasis Estimate 69%*
* The originally listed number of 57% had some problems where, for word pairs with large differences, the shared word value might be less than the smaller of the two numbers or even negative. The recalculated number given above should be a more solid reflection of what is shared between the two documents, as it simply uses the lesser of the two values, which is never lower than 0.

In the notes on the Shared Emphasis Estimate, I'll mention some other things that the statistical analysis shows: with the breakdown done at this level, you can do more than estimate how much is shared. You can also identify where the differences are.

Notes on the Shared Word Estimate

Mark is a shorter document and has 48 words included in the high-frequency word list, which is limited to words that would make at least a 1% difference in the total as discussed previously. Of those 48 words, 37 are also in Matthew's high-use words list calculated in the same way. So 37/48 = 77%, rounded to the nearest whole number. (Since the percentages involved are already effectively rounded by the exclusion of low-frequency words that would chip away at the percentage, I don't think a lot of decimal points are significant in the analysis.)

Notes on the Shared Emphasis Estimate

When it comes to the detail matching on emphasis, the two highest-frequency words are the same between the two documents: "Jesus" and "man". Matthew's list is broader. It contains 53 words in the high-frequency list. So words are generally lower-frequency in Matthew than they are in Mark. This raises a question about the method, whether some sort of adjustment is in order for the relative length of the lists. It's worth considering, but my first thought is that if we're measuring relative emphasis, and the relative emphasis were the same between documents, then the word frequency lists would be the same between the documents. So my first inclination is not to adjust for different list lengths, but to consider that difference as part of an accurate reflection that the two documents have a somewhat different emphasis.

The emphasis estimate turns out to yield more information than the originally-intended measure of how much two documents are alike. It also gives some insight into what exactly is different. So with that in consideration, the words showing the biggest difference in emphasis are "Jesus" which is emphasized somewhat less in Matthew though it is still by far the most frequent word, then "father" and "heaven" which are used noticeably more in Matthew than in Mark. Those three words account for about 10% points in the emphasis-gap between the documents. Another significant gap comes from the 11 words on Mark's list but not in Matthew's: around, began, boat, hands, days, sitting, twelve, evil, hear, James, looked. That is not to say those words don't occur in Matthew, but that they don't make the high-frequency words list as they do in Mark.

Any areas which show a difference in emphasis might be worth closer study. I find it interesting that such a practical, ordinary word as "boat" should make the high-frequency list of Mark. The early records we have about Mark say that he was writing about Jesus as told to him by one of the disciples who was a fisherman by trade. The relative emphasis on the "boat" in Mark does not prove that the source of information was a fisherman, but it is consistent with that possibility. It might indicate an area for further research, to see what kinds of information might come to light by taking a closer look at the "boat" references in Mark. The "father" and "heaven" emphasis in Matthew over Mark might also bear a closer look. Other differences (like "around" or "began") seem less promising, though it would still be best to do a quick check of the original texts to make sure that it is just a difference in narration style or something of that sort.

## Monday, December 10, 2012

### Mathematical Methods for Comparing Document Content

This post describes two different ways to calculate a rough measure of how much two documents cover the same material or the same topic. This is mainly written for those who want technical details of how the calculations work; it may not be of interest to other readers. The two methods are a "shared word estimate" and a "shared emphasis estimate".

Two Sample Lists

Imagine two very short books with the following word frequency lists:

Document #1
1. fun (15)
2. Dick (12)
3. Jane (10)
4. Spot (6)
5. see (4)
6. run (3)

Document #2:
1. fun (28)
2. Fred (27)
3. Jane (25)
4. Dick (13)
5. catch (4)
6. Spot (3)

(I haven't really run the numbers for any actual "Dick and Jane" early-reader books; these numbers are made up for the purposes of illustration.) The two measures that I'll calculate show the percentage of shared words on these lists, and then a more detailed comparison of their emphasis.

Calculating A Shared Word Estimate

For the shared word estimate -- a rough estimate of whether the documents cover similar material or subject matter -- we run a basic count of the words shared between the two lists, and compare that to the length of the list. In this simple example, each list has 6 words, shares 4 words with the other list, and contains 2 words not found on the other list. So the shared word estimate tells us that 4/6 (67%) of the common words are the same between the two lists. The shared word estimate is crude, but can be used as a first estimate of whether a more detailed comparison is in order. You can determine, mathematically or by computer analysis, that these two documents may be related. If you saw a book with another top words word list, like "eggs, green, ham, am, Sam, like", you would find 0% in common and could expect that this document was not covering the same material or narrative.

A quick look at the shared word estimate shows that there is room for improvement, though. If the top, most common word is the same on both lists, there is a higher chance that they are on the same topic than if the bottom words happen to match. A more detailed comparison is in order that takes things like that into account.

Calculating A Shared Emphasis Estimate

The "shared emphasis estimate" measures not only whether both documents use the same words commonly, but considers whether those words occur about as commonly: it measures emphasis as well. Here the first approach I tried based on word rank (how high a word scores on the list) had to be discarded, as there were significant problems with the validity of the result. Simply comparing the rank of each word from one list to the next did not account for the fact that some lists have near-ties at some places, while others have steep drop-offs in word frequency, meaning that the ranking number was not an especially clean measure of the commonness of a word. The longer the list, the greater the problem that would be presented. Then there was a question of how much to weight the first-ranked word compared to the second-ranked, and so on down the list. If we used the rank as a basis for the weight, it would introduce an inflexible and artificial scale. The more fitting method is to weight each word based on its prevalence within the documents in question.

To determine the weight for each word, then, first a total was run of all the word-occurrences in the list. Then each individual word's usage count was turned into a percentage of that total. Here are our two sample documents again, with those calculations shown:

Document #1: 50 total count for the words in the "common words list":
1. fun (15): 15/50 = 30%
2. Dick (12): 12/50 = 24%
3. Jane (10): 10/50 = 20%
4. Spot (6): 6/50 = 12%
5. see (4): 4/50 = 8%
6. run (3): 3/50 = 6%
Document #2: 100 total count for the words in the "common words list":
1. fun (28): 28/100 = 28%
2. Fred (27): 27/100 = 27%
3. Jane (25): 25/100 = 25%
4. Dick (13): 13/100 = 13%
5. catch (4): 4/100 = 4%
6. Spot (3): 3/100 = 3%
To calculate the Shared Emphasis Estimate, we take each word's emphasis percentage in the first document as our starting point. Comparing it to the second document, we subtract out the difference in how much it is emphasized there to find the shared emphasis between the documents.

For example, "fun" has 30%  value in the first document, but 28% in the second. The difference in emphasis is 2%. So the shared emphasis is 30% - 2%, or 28%, based on the first word. The other words are also added into the result.

A slight miscalculation: This calculation had to be refined because of problems in whether it was actually measuring what was intended. Originally the calculation used the absolute value of the difference, then adjusted the original amount by that, using the calculation below. Here it shows the calculation for each of the six words listed for Document1 compared to Document2, and uses "abs()" rather than "||" to mean absolute value:
1. fun: 30 - abs(30-28), or 30 - 2, = 28.
2. Dick: 24 - abs(24-13), or 24 - 11, = 13.
3. Jane: 20 - abs(20-25), or 20 - 5, = 15.
4. Spot: 12 - abs(12-3), or 12 - 9, = 3.
5. see: 8 - abs(8-0), or 8 - 8, = 0. (Not a shared word.)
6. run: 6 - abs(6-0), or 6 - 6, = 0. (Not a shared word.)
Totaling those numbers, we get 28+13+15+3+0+0 = 59% shared emphasis estimate. The Shared Emphasis Estimate typically will be lower than the cruder Shared Words Estimate. This is because the Shared Words Estimate would give full weight to a match between the least-used word and the most-used word, and takes no account of differences in emphasis.

Updated calculation: The original calculation worked acceptably well for the simple and hand-made examples above, but when comparing actual documents some problems appeared. Consider percentages like the following for a pair of words:

List1: 2%, List2: 15%.

The absolute value of the difference is 13%, and subtracting 13% from 2% we get -11%. It is then possible for a pair of words to have a negative impact, even when it appears in a significant way in both documents. Based on what I am intending to measure, the number that should be used is simply 2%, the smaller of the two numbers.

Or consider the following example:

List1: 2%, List2: 3%.

The absolute value of the difference is 1%, and subtracting 1% from 2% we get 1%. But each document has at least 2% value for that word, so it is a more accurate reflection of what I'm intending to measure if the shared value is 2%.

The refinement to the calculation is to leave out the absolute value of the difference, and simply take the smaller of the two numbers for any given pair. This will be zero when the word is on one list but not the other, but it will never be less than zero.
1. fun: lesser of 30 or 28: 28.
2. Dick: lesser of 24 or 13: 13.
3. Jane: lesser of 20 or 25: 20.
4. Spot: lesser of 12 or 3: 3.
5. see: lesser of 8 or 0: 0 (Not a shared word.)
6. run: lesser of 6 or 0: 0 (Not a shared word.)
Figuring the totals again: 28 + 13 + 20 + 3 + 0 + 0 = 64% for the shared emphasis estimate. For most of the pairs the result was the same, but now the "shared emphasis" is never less than the smaller of the two amounts, which is a more accurate measure of what that calculation is intended to show.

Further Refinements

Here I worked with two very basic (and fictitious) sample documents, where I had the prerogative of selecting the values used for the example. In real documents, another question is significant: how many words do we compare? Here we compared six words, but that was arbitrary. What is a sound method for determining how many words to include in the comparison?

Since this method is generally intended for longer works, my starting point is this: each word is added to the list in order of decreasing usage, with the most-used word being added first, followed by the second most-used word, and so forth. (Some structural words such as articles and conjunctions are typically filtered out during word counts.) When adding each new word to the word list, keep going so long as the current new word, if included, would have a value of 1% or more of the total. Once the next word would be less than 1% of the total, that's probably the point at which the additional comparison doesn't refine the result enough to be relevant. In this way, the number of words included in a list is not an arbitrary number, but is sensitive enough to respond to the different word usage characteristics of each document. At the same time, the measure remains objective to the point where the calculation could be done, content-blind, by a computer program.

I worked out the methods for calculating a Shared Emphasis Estimate by using hypothetical sample books until I had a method with an objective basis (one that could be turned into a computer program that is indifferent to the content, given the time to write the code), and that gave reasonable results.

A few potential design problems may need work. First, the 1% rule is for documents of substantial length; it is possible that it would need some amendment for shorter documents such as our mini-documents used as test cases above. I have not yet tried to compare shorter works, but the lists above suggest the problem could be real and, on a short enough document, some rarely-used words would be included simply because the word totals never reached 200, which is the tipping point for excluding words used only twice. It's possible that more of a "bell curve" approach might eventually replace the 1% rule, as something more easily scalable to different sizes of document.

Also, in larger documents especially, there may be ties in how frequently words are used: that is, more than one word might be used at the same frequency. This is common enough in larger documents, and it can happen right at the 1% boundary. In such a cluster of words of the same frequency, it is possible that the first would meet the 1% rule but the last would not if we had already added in that previous word of the same frequency. In that case, it makes no sense to show a preference for one word over another when both have the same frequency. That is to say, if the first word of a certain frequency is included under the 1% rule, then all other words of the same frequency would be included on the list because of their frequency, even if the resulting final percentage for those words might be slightly under 1% when the whole group of words is included.

A future area for exploration would be: how much can we tell about a document's content from this kind of analysis? For example, would a biography typically have the subject's name at the top of the word-frequency list? I would also be curious how different types of political and persuasive material would look, and what kinds of emphasis became apparent. I'd also see some potential for targeted word frequencies: for example, words that frequently appeared only in one portion of a document, or throughout a document but only while discussing only one recurring topic.

Forward

Next we will see how the basic approach works with actual documents instead of hypothetical ones. But that will wait for another post.

## Tuesday, December 04, 2012

### Can you measure how much are two documents alike?

In my day job as a programmer, I spend a certain amount of time analyzing data, and in the bigger projects there can be millions of records and over a billion individual fields being handled. And each individual field has to be handled correctly by specialized programming routines; designing and testing those is my job. What does that have to do with this blog? Habits carry over from one place to another, and at times I view documents -- for example the gospels, or systematic theology -- as another job in high-volume data analysis. (I know, some people think that sounds really dull. Regardless, it leads to fascinating places.)

I've done a number of word clouds on this blog. They are one way to do a quick, high-level overview of a document. The next question on my mind is: can you get an idea of how closely two documents cover the same material by comparing their word clouds? When you look at a word cloud, you see a graph of the important words for a document. The information used to create that chart is a list of words and a count of how often they appear. I've been looking at ways to take two lists for two documents and estimate how closely those two documents cover the same material. After a few tries that left much to be desired, I have a method which is promising and objective, with the important decisions being based on mathematical criteria rather than human judgment.

What could you gain with a comparison like that? You could get a rough answer to a question like, "How closely does the Gospel of Matthew cover the same material as the Gospel of Mark?" Or "How closely does the Gospel of John cover the same material as the Gospel of Matthew?" How about comparing Paul's letters to the gospels to see how closely they track each other? How about comparing the "alternative" gospels to the Bible's gospels? How about comparing a catechism or some writer's systematic theology to the gospels, or the New Testament, or the Bible as a whole? How about comparing the holy books of one religion to another, to get a feel for similarities and differences?

In upcoming posts I'm hoping to start exploring some of those questions and their answers. In a future post I will also give the mathematics and logic of how the comparison is done, for those interested. Below is the other major point for a general reader: the most important limits of the method.

Limits of the method

The first limit of the method comes from the fact that it is based on word counts: the content is summed up at the word level, without the phrases or thoughts or the relationships connecting them, without any sense of intent or purpose, logic or history. It would be possible for two authors to take very different approaches to the same concepts, and this particular method could not tell the difference if the authors used the same words at roughly the same frequency.

A second limit is the issue of synonyms and near-synonyms. Do we compare "elected" and "chosen" as the same? How about "predestined" and "foreordained"? "Walked" and "went"? Future development would include a way to factor in a weighted, partial match for near-synonyms or similar words during the matching process.

Another limit is the difficulty comparing documents at two different levels of detail. If one document discussed "oaks" and "pines" and "elms" and "maples", and another discussed "forests", this method would not see the "forests" for all the specific trees in the first document. The more different the level of detail, the more noticeable the problem becomes. For example, "Five teenagers get Saturday detention" might be a recognizable reference to the movie The Breakfast Club, but I seriously doubt that a word-cloud comparison of that phrase to the script would identify that they were talking about the same thing. A more fully-developed method would take into account how to move from the specific to the general, and what kind of detail would be the right match as you "zoom out" to higher and higher summary levels.

The method is also limited to checking for one particular type of relationship between documents: it shows documents that are probably covering the same general material. It does not cover other relationships, for example "prequel" and "sequel", "original narrative" and "commentary", or other types of relationships.

It's likely enough that more shortcomings will show themselves as we work through a few examples. But for all the limitations, it should still be a useful estimate of how much two documents cover the same topics.