PDP-10 Archive: 43,50270/spell.doc from decuslib10-03

Trailing-Edge - PDP-10 Archives - decuslib10-03 - 43,50270/spell.doc

There are 6 other files named spell.doc in the archive. Click here to see a list.

 SPELL: Spelling Check and Correction Program
1.0 INTRODUCTION
SPELL is a program designed to read text files and check
them for correctness of spelling. In addition to the
spelling check, the program provides a means for correcting
words that it thinks are misspelled. This program was
written by Ralph E. Gorin of Stanford University Artificial
Intelligence Laboratory. It has been augmented by William
Plummer and Jerry Wolf of BBN and Marshall Abrams of NBS. 
In its normal mode of usage, SPELL reads through an input
text file, asks the user about each word it does not
recognize, and creates an output file in which corrections
have been made. Provisions exist for:
 1. Loading, incrementally augmenting, and dumping
 special dictionaries. Such dictionary files are
 ordinary text files which may be listed and edited.
 Arbitrarily many dictionaries may be loaded,
 subject only to availability of (virtual) main
 memory. 
 2. Training modes where SPELL scans an input file and
 makes a list of all words it does not recognize.
 Such a list can be used as an auxiliary dictionary.
 3. Termination of spelling checking part way through a
 file and a way of picking up where you left off in
 a later session.
Other features of the program are:
 4. SPELL will read either SOS, TECO or E/TV files.
 The corrected output will be written in the same
 mode, except E/TV directories must be deleted.
 Dictionary files may be SOS, TECO or E/TV format.
 5. An exception file may also produced. This file
 contains all words SPELL did not recognize (and
 their contexts), all corrections, plus all words
 which SPELL recognized by stripping off prefixes
 and/or suffixes. This last class of words appears
 marked with "[" or "]" to denote prefix- or suffix-
 removal. E.g., FOOING] [PREFOOED] The
 affix-stripping algorithms are not foolproof (e.g.,
 CHOSES] ), so this gives a quick way to scan for
 the exceptions which may slip through.
 6. SPELL keeps track of all word spellings which were
 corrected by the user. Subsequent occurrences of
 such are automatically corrected by the program.

 Page 2
 This is reported to the user by a typeout of
 "MISSPELLING ==> CORRECTION".
 7. When a word is corrected, the output file will be
 rewritten with either upper case, lower case, or
 mixed (first letter upper, the remainder in lower),
 depending on the cases of the first two letters in
 the original word. Note: this will be incorrect
 in some cases (e.g., McCarthy).
2.0 USING SPELL
2.1 Starting SPELL
Type the command R SPELL (under TENEX, type "SPELL" to the
EXEC). All typeins to SPELL must be terminated by carriage
return. (In the TENEX version, editing by ^A, ^Q, and ^R
may be done.) (TOPS-10 accepts ^U, DEL, and, depending on
MONGEN parameters, BS.)
2.2 Augmenting The Built-in Dictionary
The non-TENEX version will first attempt to load the default
private dictionary file WORDS.DIC into dictionary 1 with
incremental insertion set (see Incremental Insertion below).
Next, SPELL will ask: "Do you want to augment the
dictionary?" If you wish to use only the main dictionary
(plus WORDS.DIC) presently in memory, type <cr>. You can
then skip the next paragraph. 
2.2.1 Private Auxiliary Dictionary - If in fact you have an
additional auxiliary dictionary (of specific terms,
infrequently used words, etc.) you wish to use, type "Y"
<cr>. You will then be asked for the name of the dictionary
file.
2.2.1.1 Incremental Insertions - After typing the file name
you will be given the option of marking the new entries as
incremental insertions. If the new entries are marked as
incremental then they will be included in an incremental
dump of the dictionary. To have the new entries marked as
incremental, type "I" <cr>; otherwise, type <cr>. (If any
of the words in your auxiliary dictionary are already in the
main dictionary then no second copy of the word will be

 Page 3
made. Hence, if your words are marked as incremental then
in a subsequent incremental dump, any words that were
already in the dictionary will not be dumped.)
2.2.1.2 Additional Dictionaries - After loading an
auxiliary dictionary the program will type the new total
number of words in the dictionary (and, except under TENEX,
the amount of core used). You will then have an opportunity
to save the new core image (normally you won't do this).
You will again be asked, "Do you want to augment the
dictionary?", thus allowing you to enter a number of
auxiliary dictionaries (limited only by the availability of
{virtual} core). 
2.2.1.3 Dictionary Format - The format of the file is one
dictionary entry (word) on a line; words must be composed
of alphabetic characters or apostrophe and less than 40
letters long. The dictionary entries need not be in
alphabetical order. A misspelling-correction pair may occur
one line in the form "MISSPELLING>CORRECTION".
2.3 Switches
You will then be given an opportunity to specify zero or
more switch options. The meanings of the switches are:
 T Training mode. SPELL will treat the input file as
 a training set rather than a file to be corrected.
 All words in the file which are unfamiliar to
 SPELL will be entered in the dictionary as
 incremental insertions. After SPELL finishes
 reading the file, the user has an opportunity to
 dump all the words that were inserted in this
 manner. The resulting list of words may be
 edited, and any words which are incorrect may be
 deleted. Then this file can be used as an
 auxiliary dictionary while correcting the original
 source file.
 This feature is provided for the purpose of easing
 the problem of creating a specialized dictionary
 of jargon and infrequently used words.
 Q Q-Training mode. In this mode, all words in the
 source file that are unfamiliar to SPELL will be
 added to the dictionary; the difference is, if
 any "new" word is "close to" some old word, the
 new word will be output to the exception file.
 The exception file will contain only such words.

 Page 4
 In this way, the spelling checker calls to your
 attention the fact that these words may be
 misspelled.
 N No suffix removal. This switch suppresses the
 attempt to remove suffixes to recognize a
 correctly spelled root word. SPELL will then find
 many more questionable words, but it will work
 more correctly than the heuristic affix removal. 
 A No prefix removal. This switch suppresses the
 attempt to remove prefixes to recognize a
 correctly spelled root word.
 U Accept Upper case mode. In this mode, all words
 that are written entirely in upper case will be
 inserted in the dictionary. This is useful when a
 manuscript file contains jargon terms that are
 written in upper case, and text-processor (e.g.,
 PUB, TJ6, or RUNOFF) commands in upper case.
 Before reading the input file, SPELL will ask for
 a dictionary number to use for all words inserted
 this way (see "How to Use Multiple Dictionaries").
 P Pickup mode. After specifying input and output
 file names, you will be asked to specify a page
 and line number for pickup. The effect is to
 suspend spelling checking until the page and line
 specified. When a user has a partially corrected
 file, this mode will enable him to skip over the
 portion of the file that has already been
 corrected. The input file will be copied without
 checking to the output until the page and line
 specified, at which point spelling checking
 begins.
2.4 File To Be Corrected
Next you will be asked for the name of the file that you
want to check for spelling errors. File names are specified
in the usual format of "name.ext[prj,prg]" where name is the
filename, ext is the file extension, and [prj,prg] is the
name of the file owner, which may be omitted if the file is
on the present user's disk area. If you omit the file name
then you will immediately enter the exit sequence (see
below).

 Page 5
2.5 Output File
You will be next asked to name the output file. Enter a
file name, or just a <cr> if you wish to use the default
(see next paragraph).
2.5.1 Default Output File(s) - In the TENEX version there
are always output and exception files; by default the
output file will have the same name and extension as the
input, and the exception file will have the extension
"EXCEPTIONS". In non-TENEX versions the default output is
to rename the input file with extension .BAK and to place
the corrected text in the (previous) input file name. The
exception file may be omitted. 
2.5.2 Exception File - The exception file, should you chose
to make one, will contain each line on which an error was
found, the indication of the page and line number, and the
suspect word. Words accepted via the affix removal
heuristics will also appear in the exception file. 
2.6 Checking And Correcting
After you have specified all the files, the program will
respond with "Working..." and start checking the input file
for spelling errors. 
2.6.1 Choices When An Unknown Word Is Found - When the
spelling checker encounters a word that isn't in the
dictionary, it will type the page and line number, the line
in which the word occurs, and the word itself.
In general, when a word is found that is not in the
dictionary, a brief message will be typed to remind you of
the possible choices. In the special case where the program
finds precisely one possible correction for the word, you
will be given the choice of typing C to accept the "guess"
or any other option. The options are:
 C The "guess" is correct. Correct the word in the
 text. Enter misspelling in dictionary so that it
 will be corrected if it appears again.
 A Accept this word, this one time. 
 I Accept this word and insert it in the dictionary
 so that subsequent occurrences of this word will

 Page 6
 be recognized and accepted. Words that are
 inserted this way are marked as incremental
 insertions and they may be dumped to form an
 auxiliary dictionary.
 R Replace this word. Type "R" <cr> and the program
 will ask you for the replacement word. If the
 replacement word is not already in the dictionary,
 the program will give you an opportunity to insert
 it.
 If the misspelling is due to an omitted space
 between words, use the "R" command to retype the
 words with the space.
 If the replacement word contains no spaces or
 illegal characters, you will be asked if you wish
 to add this replacement to the list of
 misspelling-correction pairs. If you do add the
 replaced word as a misspelling then all subsequent
 occurrences of that word will be replaced with the
 replacement string. 
 X Accept this word and finish. The word will be
 accepted. Then the remainder of the input file
 will be copied without checking to the output
 file.
 W Save my incremental insertions. After you type
 "W" <cr> you will be asked for a file name. Then
 an incremental dump of the dictionary will be
 written into the file. After the dump is complete
 you may then decide what to do with the excepted
 word.
 L Load an auxiliary dictionary. The present word is
 accepted and you will be asked for the name of the
 dictionary file to load. This is useful if you
 encounter a jargon term but forgot to load the
 appropriate dictionary. 
 D Display the line and offending word again. The
 line that is displayed will not have any
 corrections shown in it. If a line has more than
 one error the line will only be typed once.
 Subsequent errors on that line will cause only the
 particular word to be typed, unless this command
 is used.
 S If this choice is offered then the spelling
 checker has discovered several words that could be
 possible corrections of this word. If you type
 "S" <cr> then you will enter a mode where you can
 look at the words that were found by the program
 and (optionally) select one of the words from the

 Page 7
 list.
 When you enter this selection submode, the first
 word in the list of possible corrections will be
 typed followed by an asterisk. Then you have the
 following choices:
 C<cr> Use this word as the Correction.
 <cr> Show the next possible choice. When you
 exhaust the choices you are returned to the
 outer mode, and asked again.
 ^<cr> Back up in the list.
 <esc> Escape from this submode and return to the
 outer command mode.
Note that when you make a correction via the C command or by
selection from the list presented by the S command, that
correction is entered in the misspelling-correction list and
subsequent occurrences of the same misspelling will be
corrected automatically. 
2.7 Finished Processing The File
When the input file is exhausted, all files are closed, the
program types "Finished.", and the exit sequence is entered.
Except on TENEX, you will be asked if you want the default
dictionary WORDS.DIC dumped. Answer "yes" or "no" (actually
only a single letter answer is required). The user then has
several options:
E Exit now.
S Save this core image.
C Go back and correct another file.
A Augment the dictionary, set new switches, and correct
 another file.
D Complete dump of the dictionary. This will create a
 very large file, and it is not usually recommended.
I Incremental dump of the dictionary. All the words that
 were read in from a private dictionary and marked with
 an I plus those inserted while running the program are
 dumped to a file. The user specifies a file name (the
 default is WORDS.DIC). This incremental file is in a
 format suitable for editing or for use as an auxiliary
 dictionary. The words in this file are in alphabetical
 order.

 Page 8
X This command is used to get a trace count of the
 program. It is for diagnostic purposes only, and is
 displayed as a possible choice only if the program has
 been assembled as a debugging program.
3.0 HOW TO USE MULTIPLE DICTIONARIES
SPELL has a set of features whereby the user can cause the
creation of several disjoint incremental dictionaries. In
this way, the user may collect several dictionaries of
special terms. Internally, all dictionary entries are
considered equivalent as regards word searches. The
distinction between dictionaries becomes relevant when doing
incremental dumps (the I command during the exit sequence or
the W command while in the middle of execution). When an
incremental dump is requested, the user may specify a
number, e.g., W9, which selects the particular incremental
dictionary to be dumped. In this example, dictionary 9 will
be dumped. 
3.1 Dictionaries 0 And 1
Dictionary 0 is the main dictionary. Words cannot be added
to this dictionary, except by reading an auxiliary file. In
general, words that are inserted incrementally are marked as
being in dictionary 1. All words that are incremental
insertions in the dictionary will be marked in dictionary 1,
unless the user specifies otherwise. 
3.2 Specifying A Dictionary
The following places are where the user may specify which
dictionary to add to:
 1. When loading an auxiliary dictionary, if the user
 responds with "In" to the question about marking
 new entries as incremental, then the new entries
 will be marked in dictionary number n (where n is
 interpreted as decimal and should be less than 31).
 2. After a word has been rejected, type "In" to insert
 the word in dictionary number n.
 3. After replacing a word, if the replacement is not
 in the dictionary, then type "In" to insert the
 replacement into dictionary n.

 Page 9
When requesting an incremental dump, the user may specify
the particular dictionary to dump. This is allowed in two
cases:
 1. After some word has been rejected, the command "Wn"
 will cause dictionary number n to be dumped.
 2. During the exit sequence, the command "In" will
 cause dictionary number n to be dumped.
In all five cases above, if n is either 0 or omitted, then
it will be taken as being 1. 
3.3 Caution!
There is no provision in SPELL for remembering which
dictionary numbers have been used. Therefore, it remains
the individual user's responsibility to remember the numbers
of all the dictionaries that he creates. (Forgetting the
number will mean that the forgotten dictionary can not be
dumped incrementally. The words in a forgotten dictionary
will still be available, but the only way to actually get
them dumped out is to dump the entire dictionary). 
3.4 Hint:
In the course of correcting a file, it is likely that you
will be asked about words which you wish to have accepted
during this file, but which you don't wish to have saved in
your incremental dictionary(s). In these cases, simply
insert them in a "throwaway" incremental dictionary which
you don't bother to dump when you're finished. 
4.0 ABNORMAL CONDITIONS
While the program is running it is possible that certain
abnormal conditions may obtain. The usual response of the
program is to type some sort of error message. The
following is a list of the error messages in SPELL, with an
indication of the severity of the error. 
 Illegal dictionary entry: <word>
 This error occurs if an entry in a dictionary file
 exceeds 40 (decimal) characters. The word is

 Page 10
 ignored.
 0 LENGTH WORD AT HASHCP
 Somebody just asked to compute the hash address of
 an empty word. The program continues, but there
 is a possibility of error.
 HASHING ERROR
 Somebody asked for the hash address of a word that
 doesn't begin with letters or apostrophe as the
 first two characters. This is a fatal error; the
 program halts.
 DEVICE DATA ERROR (OUTPUT)
 This message means that while writing a file,
 something screwed up. The program halts.
 DEVICE ERROR (INPUT)
 The input file is screwed up in some way. The
 program halts.
 Internal confusion in the spelling checker.
 Called from location <loc>.
 The spelling checker has discovered a (possible)
 bug in itself. The program halts, but the user
 may type CONTINUE. Please note the location
 mentioned and the circumstances that evoked the
 message.
 Dictionary number too large, Maximum is 30.
 This message means that the user attempted to
 select for insertion or dumping a dictionary
 beyond the range of allowed numbers. The user
 will get another chance to do the right thing.
 Unrecognized switch.
 The user asked for an unknown switch. He must
 repeat the entire switch specification again. 
The following messages occur only in the non-TENEX version:
 Illegal Character in Scan.
 This is a message from the routine that reads file
 names. You will be asked to try retyping the
 name.
 File not Found. <filename>
 The indicated file could not be found. The user
 gets to specify some other file. 
 Enter failed on: <file name>
 An enter uuo failed while trying to select the
 indicated file for output. The user may specify
 another name. 
 Open Failed on Device <dev>:
 You asked for a device that doesn't exist or isn't
 available to the program. You will get a chance
 to ask for something else. 
 Insufficient Core Available.
 SPELL requires more core while expanding the
 dictionary but none is available. The program
 exits.

 Page 11
5.0 INTERNAL WORKINGS
5.1 Data Structures - Hashing Function
The data structure is the heart of the program, and any
efficiency in the program operation is due primarily to this
choice of data structure. The data structure is basically a
hash coding scheme where dictionary entries are accessed by
both their alphabetic order and by their length. There is a
base table that contains 26 * 26 * 10 halfwords; this table
gives anchors for 6760 chains. Each chain contains exactly
all words with the same two first letters and some given
length. To be precise, the hashing function is:
(L1*26+L2)*10+min(WL-2,9), where L1 and L2 are numeric
representations of the first and second letters (A=0, B=1,
... Z=25, and apostrophe also is 25), and WL is the length
of the word in characters.
This scheme was chosen since it provides both an efficient
way to probe the dictionary and a quick way to select a
small subset of all words that are close to a given input
word.
5.2 Data Structures - Dictionary Entry Format
Entries are added to the appropriate hash chain by the
INSERT subroutine. Entries are added to the head of the
chain, saving the time and effort of searching to the end of
the chain. This scheme means that the last item entered on
a chain is the first item seen by a search. The format of
the entry is given by:
Word 0: xwd flags,nextlk
Word 1: 5 bit representation
Word 2: 5 bit representation
 ...
There are precisely 1+ceiling(WL/7) machine words used for
each dictionary entry. WL is the length of the entry in
characters. Nextlk is the pointer to the next entry in the
list, or zero if this is the last in the chain. The left
side contains flags; bits 13-17 specify the incremental
dictionary number (0 for main, 1-30 for incremental
dictionaries, and 31 for misspellings). One can imagine
that bits 5-12 could be used to store semantic information
about the entry. The unused bytes in the last word of an
entry must be zero, since they are used to stop the routine
that converts the five bit to 7 bit. 
Misspellings are entered in the main dictionary with the

 Page 12
incremental dictionary number set to decimal 31. If a word
is marked as a misspelling then the word preceding the flags
and link contains a pointer to the flag and link word of the
correction. Misspellings may be deleted before the full
dictionary is dumped, or whenever SPELL asks about saving
the core image. The space obtained by deleting misspellings
is not reutilized at present, although that feature could be
added without great difficulty. 
5.3 Spelling Correction Heuristics
There are four kinds of errors that the program attempts to
correct:
 1. one wrong letter.
 2. one missing letter.
 3. one extra letter.
 4. two transposed letters.
For a wrong letter in the third or subsequent character, all
words that are candidates must exist on the same chain that
the suspect word hashes to. Hence, each entry on that chain
is inspected to determine if the suspect differs from the
entry by exactly one character. This is accomplished by an
exclusive-or (XOR) between the suspect and the dictionary.
Then a JFFO instruction selects the first non zero byte in
the XOR. This byte is zeroed and if the result is all zero
then the dictionary word differs from the suspect in only
one letter. All such words are listed at CANDBF, where they
can be inspected later. 
For a wrong letter in the first or second character, the
program tries varying the second letter through all 26
possible values, searching for an exact match. Then all 26
possible values of the first letter are tried, after setting
the second letter to its original value. This means that 52
more chains are searched for possible matches. 
To correct transposed letters, all combinations of
transposed letters are tried. There are only WL-1 such
combinations, so it is fairly cheap to do that. 
To correct one extra letter, wl copies of the word are made,
each with some letter removed. Each of these is looked up
in the dictionary. This takes WL searches. 
To correct one missing letter, WL+1 copies of the word are
made, each time inserting a null character in a new position
in the suspect. The null character is never part of any
word, so the suspect word augmented by an embedded null can
be thought of as a word with one wrong letter (the null)
then the algorithm for matching one wrong letter is used.

 Page 13
If the first character is omitted, all 26 possible first
characters are tried. Also, 26 more words are formed by
varying the second character in case that had been omitted.
6.0 ASSEMBLY & LOADING INSTRUCTIONS
There are three assembly time switches, TENEX, STANSW and
SANSW. If STANSW is set then there are SIXBIT ppn's and the
SWAP UUO; if STANSW is zero, then normally there are octal
ppn's except if SANSW is set, in which case, there are
decimal ppn's. If the TENEX switch is set it overrides the
others and TENEX style file names are used. Compile the
program using MACRO or FAIL and load it. The non-TENEX
versions will produce a sharable high segment and a
non-sharable low segment. When you start the program the
first time after loading, it will demand a dictionary. This
dictionary will be read in as dictionary zero, which is the
built-in dictionary for future use. In the non-TENEX
versions dictionary zero is read into the sharable high
segment. Use the largest dictionary you have which meets
your storage limitations. The file SPELLD.ALL is
recommended.
When dictionary zero is loaded, SPELL will ask "Do you wish
to save this core image?" Answer "Yes" and save the
resulting core image. On non-TENEX systems, do an SSAVE to
produce a sharable high segment containing dictionary zero.
There are various other assembly switches, but they default
to reasonable settings, and you meddle with them at your
peril. 
7.0 POSSIBLE EXPANSIONS
No program that is still in use is complete. The following
paragraphs include suggestions of possible future work in
this area. 
For non-paged systems, the main dictionary (0) should be in
the sharable segment and the private additions in the
non-sharable segment. This would require adjusting the hash
chains to point across the gap.
The dictionary should be expanded to include all suffixes
for every word. There is a feature that strips suffixes for
the purpose of finding the stem of the word in the
dictionary, but this heuristic is error prone and
incompatible with later attempts to correct the word.

 Page 14
If semantic information were included in the dictionary, it
could help guide the selection of a correction. 
Either of these would require a major restructuring of the
program, since the dictionary would no longer fit in core.
Probably, the dictionary should be kept on the disk, with a
data structure similar to the one used in core, but arranged
to keep each hash chain in the minimum number of disk pages,
so that searches through the dictionary can be made
efficient.