Appendix A

Broadly speaking, there are three main techniques for getting a literary work into machine-readable form. A method that was more popular in the early days of this type of research, but is not used so frequently nowadays, was to punch the text onto 80-column cards, one card per line of text. These cards were then read into the computer. The preferred method nowadays is to type the text directly into the computer using a conventional VDU. From there, the text can be edited as required and then either transferred onto magnetic tape for long term storage, or kept on disk until it is required. Mention should also be made of current developments in the field of optical character readers (OCR's). It used to be the case that such units were both extremely expensive and limited in scope. Recently, Kurzweil has produced a device costing some $25,000 which will read a variety of typefaces, and styles. This is already cost-effective, and as the price inevitably drops should revolutionize the field of text entry. Also, software is starting to appear which will take a scanned page and interpret it as text.

Due to the steadily growing interest worldwide in this type of investigation, there exist a growing number of clearing-houses in America, Britain, Europe and elsewhere, from which one can obtain texts in machine-readable form. These have already been input by someone else, either specifically for acquisition by the clearing-house, or because someone else has already worked on them. These texts are sent on magnetic tape, in a standard form, and are read into the computer at their destination. This was the procedure in the case of The Waves , the tape being obtained from The Oxford Archive of Texts in Machine-Readable Form .

Once one has input the text, the next stage is to ensure that it is free from errors. There four main methods for doing this. The first involves examining the text using a VDU and an editor (preferably screen-based) comparing it closely with a copy of the book. This is very tedious and time consuming, and requires great care on the part of the examiner, as simple mistakes are very hard to detect. The second method is to write a program to scan the text and print a list of all the words used in it and their locations. This list is then examined for any obvious mistypings. This method is quicker than the first, as it does not require one to scan the whole text, but it fails to take into account mistypings which look like other words. For instance, omission of the plural ending '-s' on a noun will change the word, yet it will not look out of place in the final list.

The third method is to use a combination of the first two. First one scans the list to cover the obvious errors, and then one examines the whole text to check for things that are not on the list, and as an additional safeguard. The fourth method is by far the most certain, and also the most expensive. Two typists are employed, each to type one version of text. If possible they should also use different terminals, so as to guard against the foibles of a particular keyboard, such as a sticking key. A utility found under most operating systems is then employed, which compares the two versions and prints a list of all the differences between them. This list is then used to correct one of the versions of the text. As the chances of both the typists making exactly the same mistake, at exactly the same place, are extremely small, this method produces a high degree of accuracy, somewhere in the order of 99.999%. Recently, a fifth method has appeared, developed at the University of Michigan for proofreading of the Old Testament 93 . This technique is basically a variation on the first method, but a particularly clever one. It has been discovered that proofreaders make more mistakes in checking a relatively error-free text than if the text contains many errors. The pages of perfect text lull them into complacency. The new technique involves artificially placing errors in the text at random, and making a list of their locations. The rate of errors that has been found to be optimum is about one per page, on average.

A careful note is then taken of the errors detected by the proofreaders, and a count made of which ones are genuine, and which induced. Then the ratio of detected induced errors to total induced errors is calculated. If the proofreaders are finding three quarters of the induced errors, then presumably they are also finding the same proportion of the genuine errors. In this way it is easy to calculate the probable number of genuine errors remaining in the text. Also, by keeping the proofreaders on their toes, the detection rate of genuine errors is improved. It was hoped that the fourth method discussed above had been employed in the preparation of the copy of the text used in this study, but the appearance of a few errors in the early pages soon dashed this expectation. Therefore, the third method, the most accurate after the fourth (which would have required typing another copy of the text) was used to proofread the text. The text is now not completely error free, but it is fairly (perhaps 99.9%) close.

Next, some decisions had to be made as to what extra-textual information was to be reproduced on the computer. 'Extra-textual' in this context means the type of thing that is implicit in the original medium of presentation (the book) but needs to be made explicit to the computer. This includes such things as line division, paging, partition into chapters and so on. It is necessary to reproduce these so that the results of the study can be translated back into terms that the reader can understand, and so that the critic can use the results of the study to refer accurately to the book. Consequently, special characters were interpolated into the stored image of the text (for a sample showing the conventions used, see section A.1.1 below). A carriage return was placed to mark the end of each line. At the position in the text corresponding to the top of each page, the line number, prefixed by a star, was typed thus - *008.

The pages had already been marked by whoever had previously analyzed the text, although the convention was different. All I had to do was alter it to fit my program's 'expectations'. At the start of each chapter, the chapter number was preceded by another character - chapter nine became @9. Again, all I had to do was alter the convention used. One final addition was necessary to make possible many of the analyses required. At the start of each soliloquy, the quotation mark was replaced by the two first letters of the character's name preceded by a '#'. For example, #BE What a glorious day ", said Bernard. This meant that the program used for analysis could recognize easily who was speaking 94 . This was necessary, because to a computer all strings of characters 'look' alike, and "said Bernard" would convey nothing to it. The use of '#', a character that does not occur in the text naturally, meant that the computer could be instructed to take special action whenever it encountered it. Finally, to enable examination of the use of punctuation if desired, punctuation marks which are normally not separated from the bulk of the text (such as full stops and commas) were delimited by spaces. The fully prepared text was then backed-up on tape in case of an accident, such as a disk crash.

A.1.1     Test Data

This was the data used to test the programs I wrote to analyze the text. The data is drawn from The Waves, but does not correspond literally to a particular section.



#NA The sun had not yet risen . The sea was indistinguishable fromthe sky , except that the sea was slightly creased as if a cloth had wrinkles in it . Gradually as the sky whitened a dark line lay on the horizon dividing the sea from the sky and the grey cloth became barred with thick strokes moving , one after another , beneath the surface , following each other , pursuing each other , perpetually .


The blind stirred slightly , but all within was dim and unsubstantial . The birds sang their blank melody outside . "


#B I see a ring , " said Bernard , " hanging above me . It

quivers and hangs in a loop of light . "

#S I see a slab of pale yellow , " said Susan , " spreading

away until it meets a purple stripe . "

#R I hear a sound , " said Rhoda , " cheep , chirp ; cheep ,

chirp ; going up and down . "

#N I see a globe , " said Neville , " hanging down in a drop



against the enormous flanks of some hill . "

#J I see a crimson tassel , " said Jinny , " twisted with gold

threads . "

#L I hear something stamping , " said Louis . " A great

beast's foot is chained . It stamps , and stamps , and stamps . "




#NA The sun had not yet risen . The sea was indistinguishable from the sky , except that the sea was slightly creased as if a cloth had wrinkles in it . Gradually as the sky whitened a dark line lay on the horizon dividing the sea from the sky and the grey cloth became barred with thick strokes moving , one after another , beneath the surface , following each other , pursuing each other , perpetually .




#R The grey-shelled snail draws across the path and flattens

the blades behind him , " said Rhoda .

#L And burning lights from the window-panes flash in

and out of the grasses , " said Louis .

#N Stones are cold to my feet , " said Neville . " I feel each one , round or pointed , separately . "

#J The back of my hand burns , " said Jinny , " but the

palm is clammy and damp with dew . "

#B Now the cock crows like a spurt of hard , red water

in the white tide , " said Bernard .

#S Birds are singing up and down and in and out all round

us , " said Susan .

[ Back to Roadmap / Site Index ] [ Table of Contents ] [ E-mail Author ]

©Andrew Treloar, 2015. W: E:

Last modified: Monday, 18-Sep-2017 03:29:59 AEST