
                           Viterbi Decoder
                           ---------------
                      Linas Vepstas March 2013

This directory contains code for a prototype Viterbi decoder for Link
Grammar. It barely functions and is currently very slow.

Motivation
==========
The goal of providing this decoder is to present a flexible, powerful
interface for implementing high-level semantic algorithms on top of
the the low-level link-grammar syntactic parser, and, in particular,
for steering the parse based on high-level semantic knowledge.  This
allows the parser to move beyond being merely a syntactic parser,
and to become fully integrated with general semantic artificial
intelligence.

A less abstract list of expected benefits include:

 * Incremental parsing: the ability to obtain partial results after
   providing partial sentences, a word at a time.
 * Less sensitivity to sentence boundaries, allowing longer,
   run-on sentences to be parsed far more quickly.
 * Mitigation of the combinatorial explosion of parses.
 * Allow gramatically broken/incorrect chat dialog to be parsed;
   in general, to do better with slang, hip-speak.
 * Enable co-reference resolution and anaphora resolution across
   sentences (resolve pronouns, etc.)
 * Enable annotation of the parse graph with word-sense data, entity
   markers.
 * Allow richer state to be passed up to higher layers: specifically,
   alternate parses for fractions of a sentence, alternative reference
   resolutions.
 * Allow a plug-in architecture, so that plugins, employing higher-
   level semantic (AGI) algorithms can provide parse guidance and
   parse disambiguation.
 * Eliminate many of the hard-coded array sizes in the code.
 * Fix the word-count problem during spell-guessing. So, for
   example, if the mis-spelled word "dont" shows up in the input, it
   could be issued as one word ("done") or two ("do n't"). The
   current suffix-stripping/word-issuing algo cannot deal with this
   correctly. This is also an issue for the Russian dictionary, where
   the stem+suffix processing can generate variable word counts.


The data structures used to implement this resemble those of the
OpenCog AtomSpace.  All data classes inherit from a class called Atom
(which is an atomic predicate, in the sense of mathematical logic).
Atoms are typed; the two core types are Links and Nodes. Thus, all data
is represented in the form of a "term algebra" (aka the "Free Theory",
in the sense of model theory).  This structure allows all data to be
represented as (hyper-)graphs, which in turn makes the implementation
of graph algorithms easier to implement.  All these theoretical
considerations provide a natural setting for storing Viterbi state
information.  Put differently, this provide a generic, uniform way of
holding the various partly-finished parses, and effecting state
transformations on them.  

Making the internal state directly visible allows low-level syntactic
algorithms, as well as high-level, semantic algorithms to control parsing.
In other words, the intended use of the Viterbi decoder is to provide
a framework for parsing that should make it possible to integrate 
tightly (and cleanly) with high-level semantic analysis algorithms.
Thus, reference and anaphora resolution can be done using the same
graph structure as used for parsing; it should also allow graphical
transformations, such as those currently implemented in RelEx.

Since all of the data is represented dynamically (at run-time) by
these (hyper-)graphs composed of atoms, developing custom algorithms
to manipulate the parse becomes easy: there are no strange compile-time
structures to master.  All algorithms can access the data in a uniform,
common way.

One may argue that Viterbi is a more natural, biological way of working
with sequences.  Some experimental, psychological support for this can
be found here:
http://www.sciencedaily.com/releases/2012/09/120925143555.htm
per Morten Christiansen, Cornell professor of psychology.

Why "Viterbi"?
==============
The parser implemeneted here is called a "Viterbi decoder" because it is
inspired by (and vaguely resembles) the Viterbi algorithm famous from signal
processing.  A characteristic feature of that algorithm is that it maintains
a set of states in parallel. As each new bit is received, some of the states
become inherently inconsistent (e.g. because some checksum is violated),
while other new states become possible.  Once some certain number of bits 
have been received, the ones that can be consistently interpreted with the 
checksum constraints can be output. The process then repeats with each new bit
streaming in.  Likewise, the code here keeps a set of states in parallel;
each state is tied to some possible pending output that is consistent with
that state.  As each new word comes in, some of the states become invalid
and are discarded (because the word cannot link to that state). When all
states agree on some linkage, that linkage can be confidently output. 
As words flow in, the process is repeated.  If there are ambiguities, then
multiple parse alternatives can be output (this is rather unlike the
original Vioterbi algorithm, whose goal is to generate only one output,
the single most likely output).  For language parsing, however, it is
useful to convey parse alternatives to higher-order processes which can
make a final; determination (analogous to having a higher-order checksum
functioning within the Viterbi decoder).

I like this analogy because it is vaguely biological as well: or perhaps
I should say "neural net-ish".  The multiple, provisional states that
are kept around are sort of like the activation states of a feed-forward
artificial neural network.  But this is not very deep: the feed-forward
neural net looks like a Hidden Markov Model (HMM), and the Viterbi
algorithm is essentially an HMM algorithm.  No surprise!


Status
======
Currently, the parser can correctly parse many short sentences.  It
currently runs very slowly, as no pruning algorithms have yet been
implemented.  No probabilities or weights have yet been implemented.


Using the parser
================
To use this parser, you must first configure with the --enable-viterbi
flag:

   configure --enable-viterbi

Then, in the client, switch to the viterbi parser with the ! command
!viterbi  (If confused, try !help and !var for general help).

The vitest.cc file contains unit tests. Currently, it consists of 49
tests, and all of them pass.  More tests will be added.


Historical dates
================
Idea conceived: April 2008
Coding started: October 2012
First successful long sentence parse: March 2013
