6.4   Vocabulary Similarity

The simplest approach to looking at similarities in vocabulary usage between two entities, be they two characters in a novel, or two novelists, is to look at the total number of words used by each, and then to see which percentage of them overlap. Unfortunately this technique, while very simple to use, makes no allowance for the relative importance certain words might have for a person. Presumably words with a high importance would be used more frequently and therefore frequency of usage must be taken into account so as not to distort the results. Fortunately, there is a statistical technique specifically designed to test correlation between variables which takes frequency of usage into account. This technique, called Pearson's Product-Moment formula, gives a measure called Pearson's r 88 . By way of illustration, consider two variables X and Y. If high values for X are usually accompanied by high values for Y, then the correlation coefficient will be positive, and close to +1.0. If high values for X are usually accompanied by low values for Y, then the correlation coefficient will be negative, and close to -1.0. If there is no relationship between values for X and values for Y, then the correlation coefficient will be close to 0. When it is required to correlate word lists for six characters, then the situation becomes more complicated. Here, the variables are the characters themselves, and the values are how often they use each word in the list of all the words used by all the characters. As Gustav Herdan has shown 89 , Pearson's r can be successfully applied to test for similarities of vocabulary between any number of characters or authors.

For The Waves this technique of analysis was applied to the total vocabulary of all the characters, both section by section and for the whole book, in two different ways. First the vocabulary sets of the different characters for the whole text were compared to provide information about differentiation of characters. Second the vocabulary of each character was compared section by section to provide evidence about development of characters. The results were highly significant. Even the lowest of these scores for Pearson's r is such that it might be expected to occur by chance only once in every ten thousand times. In the correlation between characters for each section, and for the whole book, the lowest score was +0.94 and the highest score was +0.98. This means that the vocabulary sets of the characters are very similar indeed. In the correlation between sections for each character, and for the whole book, the lowest score was +0.76 and the highest score was +0.98. This means that the vocabulary of the characters changes very little over time. How are we to interpret these results?

The most satisfactory explanation of the scores for correlation between characters is that all the characters are drawing on a common vocabulary set, which in turn could well imply, or be used to describe, a common pool of experience. This idea of a common experience is something that many of the critics have suggested in one form or another, but without any conclusive proof. The common vocabulary set is strong evidence, which has hitherto been unavailable, for this view. Perhaps the best way of interpreting the scores for correlation between sections is that the characters display a remarkable internal consistency in their use of language; that is, while the other measures might change, the vocabulary set stays fairly constant. This finding ties in well with suggestions made by various critics that each character draws on a unique set of symbols and images in his/her speech. These images identify them, act as a binding element across the book, and even provide a limited channel of communication, whereby characters can pick up and use other characters' images. This idea of a more or less constant vocabulary set through the book has been observed by other critics who have commented on the unnatural sophistication of the characters' speech in the early chapters, when the characters are meant to be children.

Another way of using Pearson's r is to sum the correlation scores for each character. Characters who correlate highly with other characters will have a high sum of correlation scores. On applying this technique, Bernard has the highest sum. One way of looking at this is that Bernard's use of language is closest to the common set on which all the characters draw. As he is the character who sums up their lives in the last chapter of the book, weaving them into a coherent and structured whole in which their thoughts and actions are given some lasting meaning, it is not surprising that his use of language reflects theirs, in this respect at least


