Tuesday, December 04, 2012

Can you measure how much are two documents alike?

In my day job as a programmer, I spend a certain amount of time analyzing data, and in the bigger projects there can be millions of records and over a billion individual fields being handled. And each individual field has to be handled correctly by specialized programming routines; designing and testing those is my job. What does that have to do with this blog? Habits carry over from one place to another, and at times I view documents -- for example the gospels, or systematic theology -- as another job in high-volume data analysis. (I know, some people think that sounds really dull. Regardless, it leads to fascinating places.)

I've done a number of word clouds on this blog. They are one way to do a quick, high-level overview of a document. The next question on my mind is: can you get an idea of how closely two documents cover the same material by comparing their word clouds? When you look at a word cloud, you see a graph of the important words for a document. The information used to create that chart is a list of words and a count of how often they appear. I've been looking at ways to take two lists for two documents and estimate how closely those two documents cover the same material. After a few tries that left much to be desired, I have a method which is promising and objective, with the important decisions being based on mathematical criteria rather than human judgment.

What could you gain with a comparison like that? You could get a rough answer to a question like, "How closely does the Gospel of Matthew cover the same material as the Gospel of Mark?" Or "How closely does the Gospel of John cover the same material as the Gospel of Matthew?" How about comparing Paul's letters to the gospels to see how closely they track each other? How about comparing the "alternative" gospels to the Bible's gospels? How about comparing a catechism or some writer's systematic theology to the gospels, or the New Testament, or the Bible as a whole? How about comparing the holy books of one religion to another, to get a feel for similarities and differences?

In upcoming posts I'm hoping to start exploring some of those questions and their answers. In a future post I will also give the mathematics and logic of how the comparison is done, for those interested. Below is the other major point for a general reader: the most important limits of the method.

Limits of the method

The first limit of the method comes from the fact that it is based on word counts: the content is summed up at the word level, without the phrases or thoughts or the relationships connecting them, without any sense of intent or purpose, logic or history. It would be possible for two authors to take very different approaches to the same concepts, and this particular method could not tell the difference if the authors used the same words at roughly the same frequency.

A second limit is the issue of synonyms and near-synonyms. Do we compare "elected" and "chosen" as the same? How about "predestined" and "foreordained"? "Walked" and "went"? Future development would include a way to factor in a weighted, partial match for near-synonyms or similar words during the matching process.

Another limit is the difficulty comparing documents at two different levels of detail. If one document discussed "oaks" and "pines" and "elms" and "maples", and another discussed "forests", this method would not see the "forests" for all the specific trees in the first document. The more different the level of detail, the more noticeable the problem becomes. For example, "Five teenagers get Saturday detention" might be a recognizable reference to the movie The Breakfast Club, but I seriously doubt that a word-cloud comparison of that phrase to the script would identify that they were talking about the same thing. A more fully-developed method would take into account how to move from the specific to the general, and what kind of detail would be the right match as you "zoom out" to higher and higher summary levels.

The method is also limited to checking for one particular type of relationship between documents: it shows documents that are probably covering the same general material. It does not cover other relationships, for example "prequel" and "sequel", "original narrative" and "commentary", or other types of relationships.

It's likely enough that more shortcomings will show themselves as we work through a few examples. But for all the limitations, it should still be a useful estimate of how much two documents cover the same topics.


Martin LaBar said...

Go for it!

Howard said...

It sounds most intriguing.

Weekend Fisher said...

Thank you all for the encouragement. I hope the tedium of the next post doesn't you two regret your generous spirit. :)

Take care & God bless
Anne / WF