A.2.2     Producing the Data Table

The most time-consuming part of any study dealing with a passage of text of book-length, is the necessity for repeated passes through the text each time some new piece of information is required. This is no less so for the computer. As computer time costs money, any reduction in this time is desirable 96 . To this end I wrote one large program, called TABBUILD 97 (TABle BUILD), which takes the whole text as input, makes one pass through it, and produces a table on disc which contains almost all the information required by the other programs. These other programs then use this data table as input, rather than the text. This saves a lot of time, as the data table can be organized to facilitate quick access.

The main data structure used by TABBUILD is a binary tree, with each distinct word type as a node. For each node a list of token references is maintained as a chronologically sequenced linked list. This table is derived from the text as follows.

The program proceeds through the text reading in one input token at a time, where this is defined as everything between two spaces (end of lines are regarded as spaces for this purpose). Each time it comes to a new token, it performs a search of the tree to see if it has encountered it before (the binary tree structure was chosen because it provides for near-optimal search times for a data structure in memory). If the token is a new type, it is added to the tree along with a reference giving information about its first appearance. If it is a word type already encountered, it adds another reference for this occurrence to the end of the appropriate linked list. TABBUILD proceeds in this way until the end of the text. When this is reached, TABBUILD prints out the tree in the form of a direct-access file of fixed-length records, 40 characters in length.

Each word takes up at least two records, and potentially many more. A '*' denotes the start of a new word type. The following information for each word takes the first forty characters: Count - the number of times it occurs, Frequency - the count divided by the total number of words in the text, Length - in characters. The references to every occurrence of the word take the remaining records, two per record. Each reference contains: Sequence Number - a unique number giving the position of the word in the text viewed as a sequence of input tokens, Character ID - a two letter code showing who spoke the word on this occasion, Section - another two letter code giving the chapter or text division in which the word occurred (for The Waves, a distinction is made between Preludes and Chapters; thus P1 is the first prelude, and C1 the first chapter). The final two pieces of location information in each reference are Page Number and Line Number.

The references are separated by a '/', and the list of references for a particular word type is terminated by a '$'. The program also generates a keyed-acccess index to enable fast direct access to the main data file for the purpose of (among other things) a future concordance-generator program. This is necessary because the main data file is so large 98 , and a sequential search for each word would be prohibitive in terms of both human and computer time.. Due to the book only being articulated by a limited number of voices, I designed TABBUILD to also construct seven additional data files and indices, one for each of the six characters and the narrator. This makes it much easier to examine the usage of any one character in isolation, or to compare characters as required.

Previous Next
[ Back to Site Index / Roadmap ] [ Table of Contents ] [ E-mail Author ]

©Andrew Treloar, 2015. W: http://andrew.treloar.net/ E: andrew.treloar@gmail.com

Last modified: Monday, 18-Sep-2017 03:30:01 AEST