                  A User Manual for Link Grammars
                         Davy Temperley

CONTENTS

1. The logic and notation of link grammars
2. Other general features of the parser
3. Special features of the dictionary
4. Coordinating conjunctions
5. Post-processing


1.THE LOGIC AND NOTATION OF LINK GRAMMARS

THE BASIC IDEA.  Think of words as blocks with connectors coming
out. There are different types of connectors; connectors may also
point to the right or to the left. A left-pointing connector connects
with a right-pointing connector of the same type on another word. The
two connectors together form a "link".  Right-pointing connectors are
labeled "+", left-pointing connectors are labeled "-". Words have
rules about how their connectors can be connected up, that is, rules
about what would constitute a valid use of that word. A valid sentence
is one in which all the words present are used in a way which is valid
according to their rules, and which also satisfies certain global
rules.

WORD RULES. A simple dictionary entry would look like this:

	blah: A+;

This means that if the word "blah" is used in a sentence, it must form
an "A" link with another word; that is, there must be another word to
the right of it with an "A-" connector. Otherwise the sentence is not
valid. The expression following the colon is the "linking requirement"
for the word.

A word may have more than one connector that has to be connected. This
would be notated as

	blah: A+ & B+;

A word may have a rule that either one of two (or one of several)
connectors can be used, but exactly one must be used. In the
dictionary, we notate this as

	blah: A+ or B-;

This means that if the word can make either an "A" link to the
right, or a "B" link to the left, its use in the sentence is
valid; but it must make one or the other, and it can not make both.

These rules can be combined. For example, consider the following
notation:

	blah: A+ or (B- & C+);

This means that the word must make either an "A" link to the
right, or a "B" connection to the left and a "C" link to the
right. No other combination will be valid.

Such expressions can be nested without limit, such as

	blah: (A+ or B-) & ((C- & A+ & (D- or E-)) or F+);

Some connectors are optional; this is notated with curly brackets.
For example:

	blah: A+ & {B+};

This means the word must make an "A" link to the right, and it can
make a "B" link to the right but does not have to. Curly brackets can
also be put around complex expressions, like

	blah: (A+ or B+) & {C- & (D+ or E-)};

A word can also make an indefinite number of links of the same type
to other words. For this, we use the "multi-connector" expression
"@". For instance, the word below could make any number of F links to
words to the right (but is not required to make any).

	blah: (A+ or B+) & {C- & (D+ or E-)} & {@F+};

(If a word has "@A+", with no curly brackets, it is required to make
at least one A+ link to the right; any others are optional.)

The ordering of elements in the "connector expression" is important.
What that dictates is the relative closeness of the words that are
being connected to. The further to the left the connector name, the
closer the connection must be. For example,

	blah: A+ & B+;

This means that "blah" must make an "A" link to the right and a "B"
link to the right, and the word it makes the "A" link with must be
closer than the word it makes the "B" link with.

This only pertains, however, to connections in the same direction. For
connectors pointing in opposite directions, the ordering is
irrelevant. Therefore

	blah: A+ & B-;

means exactly the same thing as

	blah: A- & B+;

For that matter,

	blah: A- & B+ & C+ & D-;

means exactly the same thing as

	blah: B+ & C+ & A- & D-;

For "or" expressions, such as "A+ or B+", the ordering of the elements
is irrelevant.

A dictionary entry thus consists of a word, followed by a colon,
followed by a connector expression, followed by a semi-colon.  The
dictionary consists of a series of such entries. Any number of words
can be put in a list, separated by spaces; they will then all possess
the linking requirement that follows:

	blah blee blay: A+;

A connector name must consist of one or more capital letters (any
number may be used), followed by "+" or "-".

GLOBAL RULES.  As well as these "word rules", which are specified in
the dictionary, there are a few other global rules which control how
words can be connected.

First of all, links can not cross. For example, the following way of
connecting these four words (connecting "cat" to "dog" and "horse" to
"fish") would be illegal. The parser simply will not find such
linkages.

                       +------------+
                 +---- | -----+     |
                 |     |      |     |
		cat   horse  dog  fish

This is the "crossing-links" (or "planarity") rule. Secondly, all 
the words in a sentence must be indirectly connected to each
other. Therefore the following way of connecting these four words
would be illegal (if it was the entire linkage).

                 +-----+     +----+
                 |     |     |    |
		cat  horse  dog  fish

This is the "connectivity" rule.  A valid sentence is therefore one
which can be linked up in a way that a) all the words are used in a
way that satisfies their linking requirements; and b) the crossing-
links and connectivity rules are not violated.

RUNNING THE PARSER. To run the parser, you must have all the parser
files (all the ones in the ftp directory ending in ".c", ".o", and
".h", as well as "Makefile") in the same directory. You must also have
a dictionary in the directory, containing the words that can be used
and their connector expressions.  Then run the parser by typing "parse
[dictionary]" (whatever the name of your dictionary file is). The
parser will then ask you to type in sentences. It will tell you if the
sentence you type in has a valid linkage or linkages, given the
dictionary it is using, and it will output the linkage that it finds,
showing the words that are linked together and the type of link
between them. So, in the following linkage, the parser has found a "D"
link between "the" and "dog", and an "S" link between "dog" and "ran".

         +-D-+-S-+
         |   |   |
	the dog ran

To exit the parser, type "!quit".

The parser will do an "exhaustive" search for linkages; it will
generate all valid linkages. It will begin by displaying the
lowest-cost linkage it finds. Other linkages may then be seen, one at
a time, by typing RETURN.  (The ordering of the output is determined
by the cost system: see "COST SYSTEM".)

LINK GRAMMAR IN RELATION TO OTHER SYSTEMS. The structure assigned to a
sentence by a link grammar is rather unlike any other grammatical
system that we know of (although it is certainly related to dependency
grammar).  Rather than thinking in terms of syntactic functions (like
subject or object) or constituents (like "verb phrase"), one must
think in terms of relationships between pairs of words. In the
sentence below, for example, there is an "S" ("subject") relation
between "dog" and "has"; a "PP" (past-participle) relationship between
"has" and "gone"; and a "D" (determiner) relation between "the" and
"dog".

       +-----Ds-----+          
       |     +---A--+-Ss-+-PP-+
       |     |      |    |    |
      the black.a dog.n has gone 

Where possible, we try to give link-types names that have mnemonic
significance in this way.

It may be seen, however, that parts of speech, syntactic functions,
and constituents may be recovered from a link structure rather
easily. For example, whatever word is on the left end of an "S" link
is the subject of a clause (or the head word of the subject phrase);
whatever is on the right end is the finite verb; whatever is on the
left-end of a D link is a determiner; etc.. As for constituents, the
verb phrase of a clause contains whatever words are on the right end
of an S link (that is, whatever words can be traced through the link
diagram by starting at the S link and tracing to the right); the
subject phrase consists of words on the left end of the S link (but
not tracing through links such as "C" and "W" which lead out of the
clause), and so on. Therefore, if it was desirable to identify
traditional syntactic elements like constituents and syntactic
functions from the linkage structure, we think it would be quite
possible to do so.

If you wish to get a better of understanding of this particular
grammar, we suggest the following. Try inputting sentences to the
parser that seem simple or basic to you. See what kind of links the
parser assigns to them (probably things like S, D, O, MV, A, and
P). Look these up in the "Guide to Links" file, and try to understand
how they work. Then continue with more complex sentences. (A simple
version of the grammar is explained in our "Link Grammar Tech Report",
available over ftp. However, some aspects of this simplified grammar
are now obsolete.

2. OTHER GENERAL FEATURES OF THE PARSER

CONNECTOR SUBSCRIPTS.  In general, a connector may only link to
another one with the same name, i.e., the same string of capital
letters.  However, there is another way of controlling how connectors
may link to each other, using connector subscripts.  A subscript is a
lower-case letter following a connector-name, like "Ss+". An "Ss+"
connector can connect with an unspecified "S-" connector, or an "Ss-"
connector, but not with an "Sp-" connector.

Connector types may have multiple subscript characters, such as
"Spa+". An "Spa+" can connect with an "S-", an "Sp-", or an "Spa-",
but not with an "Ss-" or an "Ssa-" or an "Spb-".

An "*" subscript type is a "wildcard" that can connect with anything.
Therefore, an "S*+" is exactly the same as an "S+". An "S*a+" can
connect with an "S-", an "Ss-", an "Sp-", or an "Ssa-", but not with
an "Ssb-".

MACROS. It is possible to define a single symbol as a longer connector
expression, and then use that symbol to refer to the longer expression
in the dictionary. To do this, simply choose a name for the longer
expression, and surround it with angle brackets (<>). Then treat it
like a word in the dictionary; list the name, then a colon, then the
connector expression that it should stand for. For example, we define
"<noun-main-s>" in the dictionary as follows:

	<noun-main-s>: (Ss+ & <CLAUSE>) or SIs- or Js- or Os- or 
		({[Bsj+]} & Xd- & Xc+ & MX-);

We then use this symbol in many other actual word definitions.

We use many of these macros in the dictionary, to reduce 
redundancy; there are many connector expressions that are used over
and over in longer expressions. Here are a few common ones:

	<noun-main-...>: the "main" connectors for nouns,
	used to link them to the rest of the sentence (as
	subject, object, etc).
		
	<noun-sub-...>: the "sub" connectors for nouns,
	used to link them to modifiers like prepositional
	phrases and relative clauses.

	<verb-...>: These macros are for verbs; they distinguish
	different forms of the same verb. That is, they contain
	connector types - like S-, PP-, etc. - that distinguish
	different forms of the same verb. <verb-s> is for singular
	verbs, <verb-pp> for past participles, <verb-sp,pp> for
	forms which are both simple past and past participle, etc..

	<vc-...>: These macros are for verb complements; they 
	stand for different complement expressions. Some verbs
	can connect to a direct object, using O+; some can connect to 	
	an infinitive verb, using TO+; and so on. 

WORD-FILES. The most basic way to write the dictionary is to list all
the words in a particular category, followed by a colon, followed by
their connector expression. There is another way, however. One can put
all the words in a category in a file, choose a name for the file, and
put that filename in the dictionary in place of the list of
words. When listed in the dictionary, the filename must be preceded by
a slash (/).
	
Here are the word files that are in use at the moment:

words.n.1	singular countable (i.e. not mass) nouns
words.n.2.s	plural nouns ending in "s"
words.n.2.x	plural nouns not ending in "s"
words.n.3	mass nouns
words.n.4	nouns that may be mass or countable
words.n.c.1	singular nouns that are sometimes capitalized	
words.n.c.2     plural nouns that are sometimes capitalized
words.n.p	proper names that are also ordinary words when
			not capitalized
			(see "CAPITALIZATION" for explanation)

	(In the following verb files, the final number indicates the 
	verb form. ".1" is for infinitive-plural forms, ".2" is for 
	singular forms, ".3" is for simple-past / past-participle 
	forms, ".4" is for present participles, ".5" is for gerunds. 
	On intransitive verbs, the present participle and gerund 
	expression are combined into a single dictionary entry.)

words.v.1.(1-4) intransitive verbs
words.v.1.p	special two-word passives ("lied_to_", "paid_for")
words.v.2.(1-5) optionally transitive verbs
words.v.4.(1-5) transitive verbs
words.v.5.(1-4) intransitive verbs that may form two-word verbs
			with particles like "up" and "out"
words.v.6.(1-5) optionally transitive verbs that may form
			two-word verbs
words.v.8.(1-5) transitive verbs that may form two-word verbs
words.v.10.(1-5)verbs that may be used in quotation expressions,
			like "said" ("John is here, he said").

words.adj.1	ordinary adjectives, with no special complements
words.adj.2	ordinary comparative adjectives (e.g. "bigger")
words.adj.3	ordinary superlative adjectives (e.g. "biggest")
words.adv.1	ordinary manner adverbs ("quickly", "angrily")
words.adv.2	ordinary clausal adverbs ("fortunately")
words.adv.3     adverbs like "chemically"

words.y		common year numbers ("1990", etc.)
words.s		US state names and abbreviations

WORD SUBSCRIPTS. A single word can be given several different
dictionary entries. To do this, the entries must be distinguished
by giving the words different subscripts. Words may be followed
by a subscript such as ".n". For example:

	run.n: A+ or B+...
	run.v: C+ or D+...

The parser starts at the right end of every string of characters. 
Any sequence of letters to the right of the right-most period in the
string will be considered the subscript. (Periods at the end of a 
string are simply considered part of the string; see "PUNCTUATION"
below.) 

In searching for linkages, the parser will consider each entry for
the word as a different word, and will generate all linkages found
for all entries. The subscript is shown in the display, thus indicating
which entry the parser chose for a particular linkage.

The main word subscripts we use are ".n" for nouns, ".v" for verbs, ".a"
for adjectives, and ".e" for adverbs.

THE COST SYSTEM.  We have a system for assigning a cost to a
linkage. This allows the parser to express preferences among the
linkages it finds. The cost system uses square brackets ("[" and
"]"). If a connector, or a series of connectors, is surrounded by
square brackets, it is assigned a cost. The amount of cost is equal to
the number of square brackets on each side: [A+] will receive a cost
of 1; [[A+]] will receive a cost of 2; etc.. The parser uses this cost
as a criterion for deciding which linkage to output first; it outputs
them in order of cost (i.e., lowest cost first).

(Given several linkages of the same cost level, the parser has certain
heuristics for choosing the best parse, i.e., the one to output
first. It prefers the linkages in which the total length of the links
is lowest; and, given a linkage with conjunctions, it prefers a
linkage where the length of the conjoined word-lists are similar.)

THE TWO-STAGE SYSTEM.  Related to the cost system is a two-stage
system. In working on expanding the parser's coverage, we noticed that
there were certain constructions which came up now and then but which
were relatively rare. Thse constructions could be added to the
dictionary; but then the parser was spending a lot of time searching
for rare constructions, even when perfectly common constructions could
be found.  It seemed sensible to devise a stage system, where the
parser first only considers common constructions; if none can be
found, it then considers less common ones. For this we use the cost
system, explained above. If a connector or connector expression has a
cost of 2 or more (i.e. if it is surrounded by 2 or more pairs of
square brackets), it is considered a stage 2 connector, and is only
considered at stage 2. The link index explains which uses of
connectors that are stage 2.

THE NULL-LINK SYSTEM.  Another recent addition to the system is the
"null-link" system. This effectively allows robust parsing: that is,
it allows the parser to assign some structure to a sentence even when
it cannot fully interpret it.  Basically, if the parser cannot parse a
sentence normally (that is, if it cannot find any valid linkages), it
tries ignoring one word in the sentence. It finds all the linkages it
can, ignoring just one word (some linkages may ignore one word, some
may ignore another). This is "null link stage 1". Failing that, it
then attempts to find linkages ignoring 2 words. This is "null link
stage 2". Failing that, it will continue to increment the number of
null links, until it finds some valid linkages; it will then output
all the linkages found at this stage, and stop. There maybe some
cases where it cannot find a valid linkage even if it ignores _all_
the words in the sentence; in this case, it simply gets to "null link
stage N" (where N is the number of words in the sentence), and then
gives up. 

(Note that the null-link systems respects post-processing. It keeps
incrementing the number of null-links until it finds linkages that
pass post-processing. For example, if linkages are found at null-link
stage 0, but they all fail post-processing, the parser will decide
that no valid linkage has been found and will proceed to null-link
stage 1.)

The null-link system can be turned on or off by typing the command
"!null".

BATCH-MODE. It is possible to make a file of sentences, and run them
through the parser all at once. Simply create a file, with one
sentence on each line. When running the parser, type

	parse [dictionary name] -batch < [filename]

The user can indicate which files in a batch file should be rejected,
and which should be accepted. (Batch-mode does not work with
null-links; that is, it assumes that null-links have been turned off.)
If a sentence should be rejected, precede the sentence with "*" in the
file. If it should be accepted, precede it with nothing. If it should
be accepted at stage 2 only, precede it with "-".  After processing a
batch file, the parser will then print the number of errors in the
file: i.e., the number of sentences on which its judgments differ with
the judgments indicated by the symbols.

When running batch-mode, the parser will ordinarily output only the
number of errors it makes. Thus it is primarily useful for checking
sentences, to see if a particular dictionary produce the desired
results (or to make sure that no earlier work has been broken by
recent changes).  If one begins the batch file with the command
"!echo", the parser will also output the sentences as it parses them,
as well as display information for the sentences on which its
judgments disagree with the user's. (In "!echo" mode, the right and
left walls will not be shown.)

COMMANDS AND VARIABLES. It is possible to modify the running of the
parser in various ways, while running it, by typing in certain
commands. The basic commands can be seen by typing "!help". Others are
listed under "!variables".  Many of these are self-explanatory.  For
example, "!display" changes the width of the parser display; "!null"
turns the null-link system on or off.

A few features deserve special mention. Words can be added to the
dictionary while the parser is running. If you type "!blah=bleck", the
word "blah" will be added to the dictionary under the same category as
the word "bleck". If you make changes, and then want to save them to
your dictionary files, type "!save" before typing the usual "!quit".
If you do not wish to save them, exit by typing control-C.
(Unfortunately, saved changes can only be made to word-files, not to
the main dictionary.)

One useful display feature is "!bad". The usual running of the parser
is that it will display the linkages in order of cost (lowest-cost
ones first); additional linkages after the first one can be seen one
at a time by typing RETURN. If the "!bad" variable is toggled, the
parser will, instead, simply output all the linkages it finds, including
those that fail post-processing. 

3. SPECIAL FEATURES OF THE DICTIONARY 

CAPITALIZATION. The parser respects capitalization: that is, the use
of upper- and lower-case letters. If a string is listed in the
dictionary beginning with a capital letter, then a word that is
inputted will only match it if it has the same capitalization.  (The
same with strings with capital letters in the middle, although this is
probably of little use.) However, there are a few special cases here.

There is a general category in the dictionary called "CAPITALIZED_WORDS".
This is the default category for words whose first character is 
capitalized. Any such word which is inputted which is not explicitly
listed in another category will be assigned to this category. This
is of course useful, since most capitalized words are names which are 
grammatically all the same.

A special situation occurs with words at the beginning of the
sentence.  If a sentence-initial word has an uncapitalized first
letter, it is treated in the normal manner. If it is capitalized, the
parser will first look to see if it is listed in the dictionary as a
either a capitalized word or an uncapitalized word.  If not, it will
then assign it to the generic "CAPITALIZED_WORDS" category. (If the
word is listed both as a capitalized word and an uncapitalized one,
the parser will try to use it in both ways.  Because there are certain
words which are also common names, like "Will" and "Rob", we have
created a special category for them, so that when they are used
sentence-initially, they will be recognized as possible names.)

HYPHENATED EXPRESSIONS. The dictionary also contains a special
category called "HYPHENATED_WORDS". If a string contains a hyphen, and
it is not listed in the dictionary, the parser will assign it to the
category "HYPHENATED_WORDS". This is, again, useful, since hyphenated
words are used somewhat "productively", and it would be very difficult
to list them all.

NUMBER EXPRESSIONS. The dictionary contains a category "NUMBERS".
Any numerical expression -- that is, a string consisting entirely
of numerical characters -- will be assigned to this category unless
it is explicitly listed elsewhere in the dictionary. (The string may
also contain a period, i.e., a decimal point.)

THE UNKNOWN WORD. Recently we have installed a dictionary feature
called the unknown word. A category can be defined using the string
"UNKNOWN-WORD.x", where x is any subscript. If a word beginning with a
lower-case letter is typed in that is not recognized, it will be
assigned to that category. The word is then displayed with a
question-mark in brackets, like "blah" below:

	  +-----Wd----+             
	  |    +---D--+--Ss--+-Pp+  
	  |    |      |      |   |  
	///// the blah[?].n is here 

Several different unknown word categories may be generated, labeled
with different subscripts: for example, corresponding to nouns, verbs,
and adjectives and adverbs. (These are the four categories we use,
labeled .n, .v, .a, and .e, respectively.)  The parser will search for
all linkages that can be found using each entry. If it only finds a
linkage for the "noun" category, then the output will show the unknown
word labeled ".n": in effect, the parser is then guessing that the
word is a noun.

THE SORTING OF UNKNOWN STRINGS. Notice that the parser must make
decisions about how to sort unknown strings into these categories, and
the ordering of the decisions is important. At the moment, the parser
proceeds as follows:

	1. Look up string in the dictionary as is. If there is an
		exactly matching string in the dictionary, use that.
	2. If the string contains a hyphen, assign to the category
		"HYPHENATED-WORDS".
	3. If the string begins with a capital letter, assign it to
		the category "CAPITALIZED-WORD".
	4. If the string consists entirely of numbers, assign it
		to the category "NUMBERS".
	5. If the string fits none of these descriptions, assign it
		to the category "UNKNOWN-WORD".
	6. If none of these alternatives are available, the parser will
		say: "the word "[whatever]" is not in the dictionary".
	
(Step 4 is mutually exclusive with steps 2 and 3;  therefore their 
relative ordering is unimportant.)

The category "UNKNOWN-WORD" can simply be removed from the dictionary.
In this case, the parser will proceed as shown above, except it will
skip step 5. 

THE WALL(S).  It proved to be useful to imagine that there was a dummy
word at the beginning of every sentence. We call this "the wall". The
wall has a linking requirement like any other word; it is listed in
the dictionary under "LEFT-WALL". If this entry is included in the
dictionary, the wall will be automatically inserted at the beginning
of every sentence. Because of the connectivity rule, it is then
necessary for the wall to be linked to the rest of the sentence in
order for the sentence to be valid.

Recently, we installed a "right-hand wall", which is similar
to the original wall at the other hand of the sentence. This is only
needed for certain punctuation phenomena. In most sentences, we use
a special "RW" connector to simply connect the left hand wall to the 
right hand one. The right-wall's dictionary entry is "RIGHT-WALL".

Either wall can be deactivated by simply removing the "LEFT-WALL"
or "RIGHT-WALL" entry from the dictionary. (The right-wall can also
be hidden from the display in cases where the usual "RW" link is
being used; to do this, type "!walls".)

IDIOMS. A string of words can be defined as a single dictionary 
entry. To do this, simply join the words together with underbars:

	a_la_mode: A+ or B-;

Most idioms can be interpreted either as a single "idiom" or as a
string of words (for example, "in question"). In this case, the parser
will find all linkages with both interpretations.

PUNCTUATION.  The parser is capable of handling a variety of
punctuation symbols. There are two issues to be discussed here.
One is the listing of symbols in the dictionary; the other is the
way they are "read" by the parser when they are used in sentences.

Punctuation symbols can be listed in the dictionary just like words,
and given ordinary linkage expressions. The same is true for strings
containing multiple punctuation symbols or a mixture of letters and
punctuation. The problem here is that certain punctuation symbols are
also used as the "syntax" of the dictionary: colons, semi-colons,
ampersands, etc.. Our solution to this is as follows: when listing
these special characters, or a string containing them, one must put
them in quotation marks:

	";": A+ or B-;
	"+": C+ or D-;

(The special characters that must be treated this way are precisely
those which are used in the dictionary in a "syntactic" way: "(", 
")", "{", "}", "[", "]", "@", "%", "&", "*", "+", "-", "/", "<",
">".)

When punctuation symbols are used in sentences, they will be used
in linkages according to the connector expressions listed in the
dictionary, in the normal way. There is a difference, however. It
may be noted that although many punctuation symbols are similar to
words in the ways they are used, they are often not separated from
preceding or following words by spaces. To handle this, before the
parser begins parsing the sentence, it goes through and inserts
spaces before or after certain punctuation symbols (but only when
the symbol occurs at the end of a string). It will then parse the
sentence as normal; since the punctuation symbol now has spaces on
either side, it will naturally be treated as an independent word.
The symbols that are "stripped" in this way are as follows:

	"Right-stripped characters" (when these occur at the right
	end of a string, the parser will insert a space to the left): 

	,  .  !  ?  %  )  '  's

	"Left-stripped characters" (when these occur at the left end
	of a string, the parser will insert a space to the right):

	$  (

Note that the "right-stripped" list includes the string "'s". This
is useful for possessives ("John's") and contractions ("he's", 
"it's"). A further point: the parser will perform this stripping
process on a string repeatedly, if necessary, perhaps stripping
off several characters. For example, it will convert the first 
sentence below into the second:

	John, a professor (who got a raise of 5%), is here

	John , a professor ( who got a raise of 5 % ) , is here

There is one difference between right-stripped and left-stripped
characters. Right-stripped characters can also be used at the
right end of a string defined in the dictionary; for example, one
could actually define "Mrs." or "it's" as ordinary dictionary 
entries. Before stripping off right-stripped characters, the
parser will make sure that the string is not present in the
dictionary as it appears.

If any of these symbols are used in the middle of strings, they
will be treated just like any other symbols. (And if they are used
in sentences in undefined strings, they will be treated as some 
kind of unknown string.) (Note that periods are special, however;
if the string ends with a period followed by a letter, this will
be interpreted as the word subscript, not part of the string. See
WORD-SUBSCRIPTS above.)

One exceptional case is quotation marks. Quotation marks may not be
defined in the dictionary; and they are simply ignored when they are
used in sentences.  This is sufficient to handle most uses of quotes;
generally, the presence of quotes does not affect the well-formedness
of sentences, and it is often only subtlely affects meaning. However
there are a few constructions, such as the last pair of sentences
below, which seem to be only correct when quotes are included.

	"John is leaving," she said
	John is leaving, she said
	I am meeting with my "advisor" today
	I am meeting with my advisor today
		
	She said, "John is leaving".
	?She said, John is leaving.

We are unable to control such usages at the moment.

4. COORDINATING CONJUNCTIONS

Coordination constructions do not fit naturally into the framework of
link grammars.  We have devised a method for automatically
transforming the given link grammar into another one that captures the
desired phenomena. This system is hard-wired in, and cannot easily be
modified by the user. However, it has proven to be effective in
handling the vast majority of uses of conjunctions.  Our discussion
will focus on the word "and", although the ideas apply to the use of
"or", "either-or", "neither-nor", "both-and", and "not only - but".

USES OF CONJUNCTIONS. We begin by proposing a simple definition of the
use of "and" within the framework of link grammars.  Then we'll
mention a few problems with the definition, and suggest an
improvement.  The second definition is the one used in our system.  It
has drawbacks, but on balance it has proven to be remarkably
effective.

Given a sequence S of words containing the word "and", a
"well-formed and list" L is a contiguous subsequence of words
satisfying these conditions:

1. There exists a way to divide L into components (called "elements"
of the well-formed "and" list) such that each element is separated
from its neighboring elements by either a comma or the word "and" (or
both).  (The comma and the "and" are not part of the element.)  The
last pair of elements must be separated by "and" (or a comma followed
by "and"). For example, in "The dog, cat, and mouse ran", "dog",
"cat", and "mouse" are the elements of the well-formed "and"
list "dog, cat, and mouse".

2. Each of the sequences of words obtained by replacing L (in S) by
one of the elements of L is a sentence of the link grammar.

3. There is a way of choosing a linkage for each of these sentences
such that the set of links outside of the "and" list elements are
exactly the same in all of the sentences, and the connectors joining the
sentence with its "and" list element are the same.  In other
words, if we cut the links connecting the element to the rest of the
sentence, remove that element from the sentence, and replace it by one
of the other elements, then the cut links can be connected to the
element so as to create a valid linkage.

The sequence S is grammatical if each instance of "and" in it is
part of a well-formed and list.  

For example, consider the sentence "We ate popcorn and watched
movies on TV for three days." The the phrase "ate popcorn and
watched movies on TV" is a well-formed "and" list because it can be
divided into elements "ate popcorn" and "watched movies on
TV ", which satisfy all of the conditions above.  The following two
linkages show this.  Note that in both cases the element is attached
to the rest of the sentence with an "S" to the left and an "MV"
to the right.

          +-------------------MVp-------------------+-----Jp-----+     
  +-Wd-+Sp+--Os--+                                  |     +--Dmc-+     
  |    |  |      |                                  |     |      |     
///// we ate popcorn.n                            for.p three days.n  

                              +---------MVp---------+-----Jp-----+  
  +-Wd-+----------Sp----------+---Op---+--Mp-+Js+   |     +--Dmc-+  
  |    |                      |        |     |  |   |     |      |  
///// we                   watched movies.n on TV for.p three days.n


There is a major problem with this definition.  It contains no
requirement that the words of an element of an "and" list be connected
to each other, nor be related in any way (other than being
contiguous).  This allows many clearly ungrammatical sentences to be
accepted, and generates numerous spurious linkages of correct
sentences.  For example, it would imply that "I like the beer John and
wine Harry drank" is a valid sentence.

We have two techniques to limit the set of sentences deemed
grammatical by this rule.  The first is to simply restrict the types
of connectors that can connect the element of the "and" list to the
rest of the sentence. The list of connectors allowed to do this is
contained in the list "ANDABLE-CONNECTORS" in the dictionary. (If a
connector type is included in this list, this means, in effect, that
several of them may be joined to a connector of the opposite type. So,
including "S-" on this list allows "John ran and skipped".)

The second method is to restrict the definition of a well-formed "and"
list.  Say that a well-formed "and" list is a "strict and list" if it
also satisfies the following condition: Each element must be connected
to the rest of the sentence through exactly one of its words.  (It may
use many connectors.)

This is the system that we have implemented. This logic of dealing
with conjunctions is reflected in the parser's output. A
sentence with conjunctions is outputted showing the sentence
split up into several sub-sentences:

                                  +---J--+
          +--------S----------+-MV+  +-D-+
          |                   |   |  |   |
        John, Dave, and Fred ran in the park

                                  +---J--+
               +---S----------+-MV+  +-D-+
               |              |   |  |   |
        John, Dave, and Fred ran in the park

                                  +---J--+
                          +S--+-MV+  +-D-+
                          |   |   |  |   |
        John, Dave, and Fred ran in the park

This implementation of conjunctions is "hard-wired" in, and cannot be
easily modified.  However, it covers the vast majority of uses of
coordinating conjunctions. It allows a wide variety of connector types
to be used with "and": nouns (whether as subject, object,
prepositional object, etc.), determiners, verbs, adjectives and
participles, prepositions, and adverbs. It also covers nested "and"
structures, like "The people and their sons and daughters were there."
And as well as handling "and", the system also handles the
conjunctions "or", "but", "either-or", "neither-nor", "both-and", and
"not only - but".

SOME PROBLEMS AND EXCEPTIONS.  There are a few problems and exceptions
to be discussed. Some of these are handled by the current system;
others are not.  One problem is sentences like the following:

        I gave Bob a doll and Mary a gun.  
        I know the problem Moscow created and failed to solve.

The former will be rejected since in "I gave Bob a doll", "gave" is
linked to both "Bob" and to "doll".  Thus, "Bob a doll" cannot be an
element of a strict "and" list.  In the second sentence, "Moscow"
needs to connect to "failed" and "problem" must connect to "solve", so
"problems to solve" cannot be an element of a strict "and" list.  This
phenomenon does occur (although rarely) in formal English, so we would
like to solve it.  The problem remains in our current system.

SUBSCRIPTS.  How should subscripts (on the connector names) be dealt
with?  When two or more connectors with different subscripts are
combined with "and", they may only connect to a connector that may
connect to all of them.  For example, consider the following
dictionary:

a             Ds+ 
the           D+ 
those         Dm+ 
cats dogs     Dm- 
cat dog       Ds- 

Among the determiners above only "the" can grammatically be allowed to
modify the "and" list "cats and dog".  This is because the only
connector which matches Dm- and Ds- is D+, not Ds+ or Dm+. This is the
solution we implement.

There is an exception to be handled here, however. The system we've
described so far would accept "the dog and cat runs", while rejecting
"the dog and cat run".  Both of these judgements are wrong because in
English whenever two singular subjects are " anded" together they
become plural.  We have incorporated this exception: "Ss+" connectors,
when "anded" together, may connect to an "Sp-", but not to an "Ss-".

Another problem arises with embedded clauses.  Consider the following
linkage of the sentence "I think John and Dave ran".

+-S-+--C--+-----S------+
|   |     |            |
I think John and Dave ran

                  +-S--+
                  |    |
I think John and Dave ran

This linkage is a combination of the following two sentences "I think
John ran" and "Dave ran".  This linkage should clearly be rejected.
Intuitively, the problem with this linkage is that the same "S" link
(the one between "and" and "ran") is being used to mean something that
"I think" (" John ran") and also something that is just a fact (" Dave
ran").  We have devised a system for detecting such patterns, using
post-processing.  As mentioned above, we handle conjunction sentences
by expanding the original sentence into several subsentences.  We then
compute the domain structure of the resulting linkage of each
sentence.  Finally, the original linkage is deemed incorrect if the
nesting structure of a pair of links descending from the same link
("e.g." the "S" links in the two sentences above) do not have the same
domain ancestry (are contained in the same set of domains). The
sentence then receives an "and" structure violation.

A few other problems should be mentioned. Sometimes adverbs are
used with conjunctions:

        He talked to Steve and, apparently, Fred
        *He talked to, apparently, Fred

As the second sentence shows, it is the conjunction that makes the
first adverbial use valid. We have no good way of handling this
construction. Secondly, there are some special uses of punctuation
with conjunctions. Sometimes, a comma is inserted both before and
after the final element in an "and" list (ex.1). And sometimes
semi-colons may be used instead of commas, particularly when the
and-list elements themselves contain commas (ex. 2).

        1. John, and Steve, are coming
        2. John; my advisor, Steve; and several other people are coming

Thus our system still needs some work in the area of conjunctions.

5. POST-PROCESSING

THE LOGIC OF POST-PROCESSING.  Besides conjunctions, there are certain
phenomena in English which the parser is incapable of dealing with in
its basic form. To solve these problems, we developed a
post-processing system, based on a concept we call "domains".  A
domain is a subset of the links that make up a sentence. After a
linkage has been found, the post-processing mechanism goes through the
linkage and divides the sentence up into domains based on the kind of
links that are present in the sentence. It then further divides the
links into "groups": sets of links which share a particular domain
membership.  It then applies rules which may declare the linkage
invalid based on the combinations of links present in a given group.

THE DOMAIN STRUCTURE.  A domain is started by a certain type of link;
we call this the "root link". Different types of links start different
types of domains. In most cases, the domain of a link consists of all
the links in the sentence that can be reached from the right end of
the root link, without tracing to the left through the root link
itself.  For example, assume that the C link in the following sentence
begins a domain. This domain will include the Sp link and the I, but
not the Ss and the O.

             +--C---+
         +-Ss+O-+   +Sp(e)+I(e)+        
         |   |  |   |     |    | 
	He told me they would  go

(The letters in parentheses indicate that the Sp and I link are in an
e-type domain.)  In general, this means that the domain of a link will
include all the links to the right of the root link. But not
always. In the example below, the C link starts a domain; but this
only includes the Ss and O links, not the Xc, CO, or Sp.

          +---------CO---------+
          +------Xc---------+  |
          +-C-+Ss(s)+O(s)+  |  +Sp+
          |   |     |    |  |  |  |
	After he   saw  us  ,  we left

In the case below, the domain started by the C link actually extends
back to the left of the root link, to contain the B link as
well. (This follows naturally from the way domains are defined. The B
link can be reached from the right end of the C link, without tracing
back through the C.)

         +---------Bsw(e)------+
         |   +---I---+         |
         |   +SI+    +-C--+S(e)+
         |   |  |    |    |    |
	Who do you think you  saw

If one wants a certain link type to start a domain, it must be
included in the list "domain_starter_links" in the file
"post-process.c". It must also be included in the following list,
("CDNP CDNP_array"), along with the name of the domain-type that it
starts.

As we saw above, it is possible for domains to be traced back to the
left of the root link. This tends to be prevented, however, by what we
call "restricted links". If a link is restricted, this means that if a
domain is being traced through it to the left, it will be traced no
further than the restricted link. So, for example, Bsw is a restricted
link. This means that, in the sentence above, the "e" domain started
by the C will contain the Bsw; but if there are any links which can be
traced from the left end of the Bsw (i.e., another link coming out of
the word "who"), they will not be included in the domain.

There are a few other complications in the way domains are generated.

1. The root link of a domain may or may not be included in the domain
it starts. Root links will not be included, unless they are listed in a
special list in post-process.c, "domain_contains_links".

2. Some domain-types are "bounded domains". This means that they
are not allowed to extend to the left of their root link at all
(even with a "restricted link"). If they do extend in this way
in a linkage, the linkage will be declared invalid. (See "C" in the 
Guide to Links for an explanation of how this is used.)

3. As well as ordinary domains, there are two special kinds. One kind
is "urfl" domains. As well as including everything that can be reached
from the right end of the root link, these domains include everything
that can be reached from the left end of the root link, tracing to the
right, underneath the root link (but not over it). In the example
below, the TOo link starts an "urfl" domain; as well as including the
"I" link, as a normal domain would, this domain also includes the O.

          +---TOo--+      
      +-S-+O(x)+   +I(x)+
      |   |    |   |    | 
      I asked him to   go

The final kind of domain is "urfl-only". These include ONLY links that 
can be reached from the left of the root link, tracing to the right
underneath the root link. In the case below, the S link starts an
"urfl-only" domain.

         +-------S-------+     
         +---O(d)---+    |
         +     +D(d)+    +-O+  
         |     |    |    |  |  
      playing the piano is fun 

"Urfl" and "urfl-only" domains are defined by the link-types that
start them. Any link that is added to the "urfl_domain_starter_links"
list will start an "urfl" domain; any link that is added to the
"urfl_only_domain_starter_links" will start an "urfl-only" domain.

Domains may be nested; a link may therefore be in several domains at once.
The domain membership of a given link can be shown in the following way:

  +----------------------RW---------------------+
  +--W-+-S+-P-+-MV-+-C+-S-+-C+-S+--P--+         |
  |    |  |   |    |  |   |  |  |     |         |
///// he got mad when I said I was leaving RIGHT-WALL 

               /////          RW      <---RW---->  RW        RIGHT-WALL
 (m)           /////          Wd      <---Wd---->  Wd        he
 (m)           he             Ss      <---Ss---->  S         got
 (m)           got            Pa      <---Pa---->  Pa        mad
 (m)           mad            MV      <---MVs--->  MVs       when
 (m)           when           Cs      <---Cs---->  C         I
 (m) (s)       I              Sp*i    <---Sp*i-->  S         said
 (m) (s)       said           Ce      <---Ce---->  C         I
 (m) (s) (e)   I              Sp*i    <---Spii-->  S*i       was
 (m) (s) (e)   was            Pg      <---Pg---->  Pg        leaving


(In the structure shown here, the domain structure is strictly
hierarchical; every domain that is partially inside another is
completely inside that domain. There is no a priori reason why domain
structure should always be strictly hierarchical; but we believe that,
given the current grammar, it always will be strictly hierarchical, at
least in sentences that are otherwise judged as acceptable by the
parser.)

GROUPS.  The domain structure is really a means to creating a more
useful kind of structure. This is the "group".  A "group" of links is
the set of links that have the same domain membership. In the above
example, then, the "Spii" and "Pg" are part of the same group. The
"Ce" and "Sp*i" are in another group. The Ce and Pg are not in the
same group.  Groups correspond roughly to subject-verb expressions -
groups of links that are part of a clause, but not part of any
subordinate or dependent clauses within that clause. For example, in
the above case, "He got mad" is one subject-verb expression; "I said
..." is another, "I was leaving" is a third.

POST-PROCESSING RULES.  The domain structure thus divides a sentence
into groups of links.  This then allows us to enforce constraints on
the link-types that are in a subject-verb expression. This is useful
in cases where there are constraints on the combinations of links that
can be present in a clause, but the links may be separated (i.e., they
may not all connect to the same word), making the constraints
difficult to enforce using link logic. We do this by using certain
kinds of rules. One is the "contains_one" rule. This says that if a
group contains a link of a certain type, it must contain exactly one
link of another type. Another kind of rule is the "contains_none"
rule.  This says that if a group contains one kind of link, it may not
contain any of a certain kind of link.

In each case, we have a "triggering" link-type: a link that triggers
the rule, and enforces a certain constraint. We also have a
"criterion" link-type: a link that defines the constraint (whether it
is "must contain X" or "may not contain X"). The triggering link for a
rule must be contained in a line of the form

    PP(!contains_one("X", Blah),"Warning!");

Where "X" is the triggering link-type, and "Warning" is a message that
will be outputted when the rule is violated. "Blah" is a symbol for
the criterion link or links for the rule. These are then listed in a
line of the form:

    static char * Blah[] = {"Y", "Z", NULL};

where "Y" and "Z" are the criterion links for the rule. A contains_one
rule means, "the group must contain exactly one of all the link-types
on this list". A contains_one rule means, "the group may not contain
any of the link-types on this list".

LINK-TYPE MATCHING IN POST-PROCESSING.  Link-type matching in
post-processing requires some explanation. In the dictionary,
subscripts are used to create sub-categories of connector-types. "Ss+"
will link with "Ss-" and "S-", but not "Sp-". The character "*" is
used as a wild-card; it will match to any character. An unsubscripted
connector name, like "S+", can thus be regarded as equivalent to
"S***...+".

The post-processor also requires a system for matching connector
types. While the linkage stage is looking at link-types on connector
expressions, post-processing is looking at the resulting link-types
that are formed when a linkage is complete.  (If an "Ss+" has linked
together with a "S*a-", what the post-processor sees is "Ssa".) It is
then comparing them to link-names listed in "post-process.c" (as
domain-starting links, as triggering links for rules, as criterion
links for rules, etc.). The link-type matching system used in
post-processing is similar to the linkage-level one, but it is a
little different. As mentioned above, from the dictionary's point of
view, any empty subscript places are considered to be filled with
"*"'s. Post- processing does the same thing. If it sees an "S" as a
link-type, it regards this as "S***...". But, rather than regarding
"*" as a wild-card, the post-processor regards it as just another
character. From the p.p.'s point of view, an "S" in a linkage does
_not_ match with an "Ss" in post-process.c; the "S" is regarded as an
"S*", and this is distinct from "Ss". Nor does an "S" in
post-process.c match with an "Ss" in a linkage. However, there is also
a wild-card character in post-processing; this is "#". An "S#" in
post-process.c does match with a "Ss" in the dictionary, as well as
with an "S". In short: in both the dictionary and post-process.c, an
empty subscript place will be regarded as a "*". But in the
post-process.c, the "*" is _not_ a wild-card character; the "#" is the
wild-card.

The specific uses of post-processing are fully explained in the
"Guide to Links".  For example, see "SF: filler-it"; "SI", and "MV:
comparatives".
