How related are two sentences?

Background

We can tell when two sentences are related. Computers, in general, can't. Computers focus on the statistics of the text. It is, as de Sassure noted, difficult to jump from the spelling of the word cat to the meaning of the word cat. This notebook demonstrates a way to think about using programs to compare the meaning of phrases. The approaches are simple and, hopefully, instructive without being simplistic. We're aiming for an understanding, rather than something ready for deployment.

First steps: number of words in common

A simple way to quantify similarity is to count the number of words two phrases share. (This is the basis for Jaccard similarity.) That approach is simple and usually inaccurate. In the following example, the first two sentences are closer in meaning to each other than to the third.


The second and third sentences have more words in common, (\(\frac{2}{4}\)), than the the first and second (\(\frac{2}{5}\)).

The number of words two phrases have in common need not indicate how semantically similar they are. With a nod to social media we venture, but leave as an exercise to a reader to prove, that this approach is most inaccurate when comparing short pieces of text that cover many topics, e.g. tweets. To visualize these differences we make a heat map where the \(ij\) box indicates the similarity betwee the \(i\)th and \(j\) phrases and the color of that box indicates the degree of similarity. This measure of similarity runs from \(0\) (completely different) to \(1\) (identical).


A better approach: number of meaningful words in common

The idea that phrases with similar meanings share something we can count is not misguided. Representing the meaning of every possible sentence by a function and demonstrating that one can calculate a distance between any pair of those functions is an open question in computational linguistics. (We're working on it). We can split the difference between counting all words and a hyperspace of functions by counting all words yet weighing some words more than others .

Our quantification of similarity needs two improvements:


(Splitting a sentence into words is, itself, an intricate process. In social media we must consider abbreviations and emoticons. Foreign languages, such as Chinese , Arabic, or Turkish do not share the English and Romance language approach of dividing a phrase into content words and grammar words and then separating those words with spaces.)

The rest of this notebook discusses how we might make these improvements and what we learn about language from them.

Weigh "crack cocaine","marijuana", and "baseball" more heavily than the other words

Another way of saying "equal weighting" is "unweighted average". This suggests that we can weigh some words more than others by calculating similarity as a weighted average of shared words.

The average, \(\bar{x}\) of a group of numbers, \(x_1,x_2,\ldots,x_n\), is the quotient of the sum of those numbers, \(\sum x_i\) and the count of those numbers, \(n\). The following is an example for three numbers.

$$ \bar{x} = \frac{1}{n}\sum x_i \Rightarrow \frac{1}{3}(x_1 +x_2 + x_3) $$

Because \(x = 1x\) we can rewrite each term of the sum as multiplied by \(1\).

$$ \bar{x} = \frac{1}{3}(1x_1 +1x_2 + 1x_3) $$

The sum of the coeffficients in front of each variable, \(x_i\), is \(3\), which is to say that the sum of the coefficients in an unweighted average is another way to count the number of numbers in the data being analyzed.We generalize an unweighted average to a weighted average by allowing the coefficients to have different values, as the following example demonstrates. In weighted averages, the sum of the coefficients still provides a normalization factor, but it no longer equals the cardinality of the data. $$ \bar{x} = \frac{1}{4}(1x_1 +\mathbf{2}x_2 + 1x_3) $$

This equation calculates a weighted average of our three values. That average gives twice the importance to the middle value as it does to either flanking value. We now introduce dot product notation (see the dots?). This notation helps us work more compactly with larger amounts of data. $$ \mu_x = \frac{1}{4}\begin{pmatrix}1\\2\\1\end{pmatrix}\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix} = \frac{\mathbf{w}\cdot\mathbf{x}}{\mathbf{w}\cdot\mathbf{1}}$$

In this equation the first column vector is a vector of weights, often called a kernel. To compute a dot product we take the sum of the product of the corresponding entries in two sequences of numbers. Dot product pop up in many applications, from theoretical physics to finance, which is why Excel has a SUMPRODUCT function. The denominator in the last term on the righthand side of this equation is a way to take the sum of all elements in one vector. The term \(\mathbf{1}\) denotes a vector of ones with the same length as \(\mathbf{w}\).

Finding the values for the kernels: Ontologies

To use a kernel we must first figure out the weights. We expect that this will require referring to a list of the similarities of meanings of words in various contexts. The relationship between a word and any of its possible meanings is artbitrary in formal speech and slang. A cat is not more like a cat than a bat because the word denoting the animal begins with a 'c' rather than a 'b'. This is not sophistry. Recognizing that we will need a list of meanings suggests that to adequately compare the meaning of sentences we need to quantify the effect of context on meaning. Recognizing this also suggests that the rate-limiting process when analyzing large collections of phrases will be how fast we can crawl a list of meanings.

List of meanings of words are formally called ontologies. A widely used ontology in natural langage processing is WordNet (Wiki). WordNet groups English words by similarity of meaning. (N.B> WordNet was not created to be an ontology, but it works.)

Constructing a measure from an ontology