Here is the first analysis of actual documents with the mathematical models discussed previously. I've taken my first document as the Gospel of Mark and the second as the Gospel of Matthew, using the word clouds linked here.
The short version of the results
Shared Word Estimate 77%
Shared Emphasis Estimate 69%*
* The originally listed number of 57% had some problems where, for word
pairs with large differences, the shared word value might be less than
the smaller of the two numbers or even negative. The recalculated number given above should be a
more solid reflection of what is shared between the two documents, as it simply uses the lesser of the two values, which is never lower than 0.
In the notes on the Shared Emphasis Estimate, I'll mention some other
things that the statistical analysis shows: with the breakdown done at
this level, you can do more than estimate how much is shared. You can also identify where the
Notes on the Shared Word Estimate
Mark is a shorter document and has 48 words included in the high-frequency word list, which is limited to words that would make at least a 1% difference in the total as discussed previously. Of those 48 words, 37 are also in Matthew's high-use words list calculated in the same way. So 37/48 = 77%, rounded to the nearest whole number. (Since the percentages involved are already effectively rounded by the exclusion of low-frequency words that would chip away at the percentage, I don't think a lot of decimal points are significant in the analysis.)
Notes on the Shared Emphasis Estimate
When it comes to the detail matching on emphasis, the two highest-frequency words are the same between the two documents: "Jesus" and "man". Matthew's list is broader. It contains 53 words in the high-frequency list. So words are generally lower-frequency in Matthew than they are in Mark. This raises a question about the method, whether some sort of adjustment is in order for the relative length of the lists. It's worth considering, but my first thought is that if we're measuring relative emphasis, and the relative emphasis were the same between documents, then the word frequency lists would be the same between the documents. So my first inclination is not to adjust for different list lengths, but to consider that difference as part of an accurate reflection that the two documents have a somewhat different emphasis.
The emphasis estimate turns out to yield more information than the originally-intended
measure of how much two documents are alike. It also gives some insight into what exactly is
different. So with that in consideration, the words showing the biggest difference in emphasis are "Jesus" which is emphasized somewhat less in Matthew though it is still by far the most frequent word, then "father" and "heaven" which are used noticeably more in Matthew than in Mark. Those three words account for about 10% points in the emphasis-gap between the documents. Another significant gap comes from the 11 words on Mark's list but not in Matthew's: around, began, boat, hands, days, sitting, twelve, evil, hear, James, looked. That is not to say those words don't occur in Matthew, but that they don't make the high-frequency words list as they do in Mark.
Any areas which show a difference in emphasis might be worth closer study. I find it interesting that such a practical, ordinary word as "boat" should make the high-frequency list of Mark. The early records we have about Mark say that he was writing about Jesus as told to him by one of the disciples who was a fisherman by trade. The relative emphasis on the "boat" in Mark does not prove that the source of information was a fisherman, but it is consistent with that possibility. It might indicate an area for further research, to see what kinds of information might come to light by taking a closer look at the "boat" references in Mark. The "father" and "heaven" emphasis in Matthew over Mark might also bear a closer look. Other differences (like "around" or "began") seem less promising, though it would still be best to do a quick check of the original texts to make sure that it is just a difference in narration style or something of that sort.