Computational Techniques Used 82
The most common way to get a text into machine-readable form is to type it into the computer (and thus onto disk) using a visual display unit. From there, the text can be transferred onto magnetic tape for long term storage, or until it is required. One can sometimes bypass this process by finding a version of the text that already exists in machine-readable form. (This usually means that someone has had it typed in for their own research purposes). This latter procedure was followed in this case, the text being obtained on magnetic tape from The Oxford Archive of Texts in Machine-Readable Form.
Once the text has been obtained in a form that the computer can 'read', the next stage is to ensure that it is free from errors. In this case the proofreading involved two stages. First, a list of all words used in the book (produced by one of the computer programs used for its analysis) was scanned as a check against obvious mistypings. Second, the book as a whole was proofread conventionally as a double-check.
Next, some decisions had to be made as to what extra-textual information was to be reproduced on the computer. 'Extra-textual' in this context means the type of thing that is implicit in the original medium of presentation (the book) but needs to be made explicit for the computer analysis. This includes such things as line division, paging, partition into chapters and so on. It is necessary to include these for the purposes of both input (because they are part of the structure of the text, and constitute one of its levels) and output (so that the results of the study can be translated back into terms that the reader can understand, and so that the critic can use the results of the study to refer accurately to the book).
Consequently, special characters were interpolated into the stored image of the text. A carriage return, detectable by the computer, but not visible on the screen, was used to mark the end of each line. At the position in the text corresponding to the top of each page, the line number, prefixed by a star, was typed thus - *008. At the start of each chapter, the chapter number was preceded by another character - chapter nine became @9. One final addition was necessary to make possible many of the analyses required. At the start of each soliloquy, the quotation mark was replaced by the two first letters of the character's name preceded by a '#'. This character that was chosen because it did not occur naturally anywhere in the text. For example, #BE What a glorious day ", said Bernard. This convention was necessary because to a computer all strings of characters are read in the same way, and "said Bernard" would convey nothing.
This meant that the computer program could accurately determine who was speaking at any given moment. As the chapters do not contain any text that is not spoken by the characters (with the exception of 'Said X') this form of marking was sufficient. For other texts, more sophisticated techniques would need to be devised, and in the case of deliberate ambiguity, editorial intervention required. The preludes were assigned to a character called the Narrator, as these passages are the closest thing to narrative in The Waves. All of these additional pieces of information act as another level of encoding, parallel with the main text.
The edition of the text used was the Hogarth Press Uniform Edition. Several other editions have appeared in recent years, all with their own pagination, leading to inconsistency in critical references to the text. Because the uniform edition was used for this study, all references made in this thesis can easily be found in the standard text and are thus not tied to the vagaries of a short-lived edition.
5.2 The Programs and their Output
The bulk of the computer programs that form the basis of this thesis were written by the author in Pascal. An explanation of the overall methodology used, and a detailed description of the programs can be found in Appendix A, Section 2.
All the programs were checked for accuracy by hand. Two pages of test-data 83 were devised to simulate all the possible conditions that might occur in the text. The programs were then run, and the results obtained compared with those calculated by hand for the same test data. This of course does not guarantee the accuracy of the results for the text as a whole, but it does provide some measure of confidence in the results produced by the programs.
Central to the computer analysis was a table of information derived from the text by one of the programs. This main data table contained the following information for every word in the text:
As the book is only articulated by a limited number of persons, the program was also designed to construct seven sub-tables, one for each of the six characters and the Narrator, in addition to the main table for the whole book. This made it much easier to examine the usage of any one character in isolation, or to compare characters as required. Other programs then operated on one or more of these tables to produce the required analyses.
- Count - the number of times the word occurs;
- Frequency - the count divided by the total number of words in the text;
- Length - of the word in characters;
- and a sequence of references to every occurrence of the word, in text order.Each reference contained:
- Sequence Number - a unique number giving the position of the word in the text;
- Character ID - a two letter code showing who spoke the word on this occasion;
- Section - another two letter code giving the chapter or text division in which the word occurred. (Preludes and Chapters were distinguished by the first character of this code. Thus P1 is the first prelude, and C1 the first chapter);
- Page Number;
- Line Number.
Because the information contained in the table and sub-tables covered the whole book, it was necessary to break it down into some more readily digestible form. This was done in two different ways.
First, a program was written which accepted either the main table or any of the sub-tables as input, and proceeded to extract each word, and the number of times it was used. From this the program calculated each word's relative frequency, and printed this, together with the words and the counts, in a list. At the end of the list were printed the number of types (or distinct words), the number of tokens (or word occurrences), and the ratio of the natural logarithms of the two for the whole list. (See section 6.3 for discussion of this measure).
Secondly, a breakdown by sections was produced together with the same attendant information. In this case the frequencies were relative to the number of words used in the particular section, not the whole book. This meant that the number of times a word was used was given as a percentage of the number of words used in that particular section, rather than relative to usage for the whole book. This provided a truer picture of the usage in that section, by avoiding the levelling effect caused by a comparison with overall usage. The total potential number of lists that could thus have been created was sixty three - seven persons (six characters and the narrator) multiplied by nine sections. The true number was in fact only fifty three, as Bernard alone speaks in chapter nine and not all the characters speak in every chapter.
These lists, one for each person and section, made possible the examination of changes in word usage for any character from section to section, and the comparison of characters at any stage of the book. These lists of words were printed out in two ways: first, they were sorted alphabetically to enable quicker location of any given word; and secondly, they were sorted by frequency of usage to enable the most frequently used words for each character and each section to be easily located, as well as to examine the relative importance of various words. The alphabetical lists in particular simplified the plotting of changes in pronoun use.
Some information could not be derived from the main table. A special program was designed to go through the text once and obtain the length of every sentence in words, and the length of every soliloquy, both in words, and in sentences. These lists were then broken down both by character, and by section. Using these lists it was possible to determine the mean sentence and soliloquy length for each character and section.
As well as all these lists, the computer was also used for statistical analysis of the data. These were performed with the aid of SPSS (Statistical Package for the Social Sciences), a collection of programs designed to perform a wide range of statistical analyses. Most basic of the analyses possible with this package are sumarry statistics - the mean, the variance, the range of the input data, the standard error, the kurtosis of the curve (assuming a standard distribution), the minimum, the maximum, the skewness of the curve, and the standard deviation. Of all these, the mean or average was of most immediate interest. This measure was obtained for two groups of data. Firstly, it was used to determine the average wordlength for all the characters and all the sections from the main data table. Secondly, it was used to determine the average sentence and soliloquy lengths for all the characters and sections using data from the program mentioned above.
Another of the routines provided by SPSS routines gives a measure of the correlation between two variables, based on lists of the values taken by the two variables at the same time. Two types of correlation were performed: correlations between characters for each section, and correlations between sections for each character. The first was of interest for the light it threw on the question of differentiation, and the second for development.
End of Chapter
©Andrew Treloar, 2017. W: http://andrew.treloar.net/ E: email@example.com
Last modified: Monday, 11-Dec-2017 14:42:24 AEDT