Wednesday, September 14, 2011

Markov Fun

I recently implemented a Markov chain for an app I'm developing. Markov chains, as applied to k-grams, predict successive characters based on the previous k-1 characters. The goal is to build a Markov model of natural language and execute word prediction as a user enters in text.

My implementation works by first "training" on a given text. I found a speech online given by Robert Sedgewick to his incoming intro to CS class in 2001 which i used for this purpose. During training, the code parses the given text and determines the frequency at which a unique set of k-1 words is proceeded by another word. This "frame" is called a k-gram. For instance, in the given text, there are 3 6-grams starting with the first 5 characters, "come ": "come u","come t" (twice), and "come o" (twice). A k-gram is based on characters; an n-gram, words.

The chain is initialized by giving a seed to start predicting on. Below, I used "Welco". The code then makes a linear directed graph connecting states together. A state is just a k-gram, so neighbors differ by sliding one character. Each state can be seen as a reference frame, keeping a "snap-shot" of the most recent k characters. Thus, the Markov chain keeps a history of each "snap-shot".

Proceeding states are chosen psuedo-randomly based on the formerly parsed frequencies. When a number of states has been reached, the code stops generating new states and outputs the resulting text.

Here's 500 states of 7-grams (kgram):

Welcome to life.

As a research effort to the online, and we can absorb all knowledge.
However, we do not even having a wire OUT of people that what happens next is computer to
presenting industry)
a method for transferring knowledge.
However, we do not even any Internet.  And a Memex-like machine, and the personal computing,
but you need technology.  Let me elaborate.

GETTING INFORMATION INTO THE BRAIN

How about the process of transfer happen when we read": isn't that are
parents and protect informa

And 100 states of tri(3-)grams (n-gram):

Welcome to Princeton.  Make the most out of their prowess in raising children?
 Probably a combination of all.  Education is not a matter of course.  Most professors also
 frequently build new courses, synthesizing new material to be shared by everyone!
 
 But I've just argued that machines can't speed up our ability to
 understand books.  We can't expect machines by themselves to
 understand the book.  The researchers there were very productive:
 within a period of about 5 years, a small slice of what's known
 or optimistic because we can now.
 
 In research, the ratio of the

As the reader may notice, the bi(n)gram sounds grammatically more natural. However, is it actually more accurate and/or precise at predicting the next gram in the original text? Further tests will be done to ensure that an optimized model can provide my app the best word prediction possible.

No comments:

Post a Comment