CodeTrials: Markov Statistical Fun

Per my last post, I offer some statistics of some experimentation I've done with my Markov implementation. I wrote some code to compare the true future state of the text to predictions of the top (based on transition probability) 1,3,5, or 7 potential states (based on training data) to see the performance of different Markov models.

My training data is 20 years worth of State of the Union Addresses (Bush Sr., Clinton, Bush Jr.). There were 895,163 characters (86 unique) and 154,934 words (4,310 unique.) The testing data is a State of the Union Address by Carter. There were 26,610 characters (68 unique) and 4,592 words (1,543 unique.)

Here is a plot of frequencies of unique k-grams:

The log y scale may be deceiving. For instance, the testing text has 4% the unique 10-grams as the training text.

Here are the success rates for all opportunities to predict the next state:

The left part specifies character k-grams; the right, word n-grams.

Here are the success rates for all attempts to predict the next state: (Attempts are anytime the current text's state is known by the trained model.)

It appears that as the k gets larger, more unknown k-grams are encountered in the text. This makes sense as each increment of k is exponentially more possibilities. e.g. for characters 26^k, not counting spaces and punctuation. Although effectively, the english language offers many less possibilities. It would be interesting to study how many spelled unique english k-grams there are.

As seen in the charts, offering more predictions leads to better results. However, reasonable results are already achieved at 3 options. Also, the character k-grams are best at predictions. We saw that a character 6-gram was good enough at producing readable sentences. So in this case, using character k-grams are much better!

CodeTrials

Friday, September 16, 2011

Markov Statistical Fun

No comments:

Post a Comment