CodeTrials: September 2011

Friday, September 16, 2011

Markov Statistical Fun

Per my last post, I offer some statistics of some experimentation I've done with my Markov implementation. I wrote some code to compare the true future state of the text to predictions of the top (based on transition probability) 1,3,5, or 7 potential states (based on training data) to see the performance of different Markov models.

My training data is 20 years worth of State of the Union Addresses (Bush Sr., Clinton, Bush Jr.). There were 895,163 characters (86 unique) and 154,934 words (4,310 unique.) The testing data is a State of the Union Address by Carter. There were 26,610 characters (68 unique) and 4,592 words (1,543 unique.)

Here is a plot of frequencies of unique k-grams:

The log y scale may be deceiving. For instance, the testing text has 4% the unique 10-grams as the training text.

Here are the success rates for all opportunities to predict the next state:

The left part specifies character k-grams; the right, word n-grams.

Here are the success rates for all attempts to predict the next state: (Attempts are anytime the current text's state is known by the trained model.)

It appears that as the k gets larger, more unknown k-grams are encountered in the text. This makes sense as each increment of k is exponentially more possibilities. e.g. for characters 26^k, not counting spaces and punctuation. Although effectively, the english language offers many less possibilities. It would be interesting to study how many spelled unique english k-grams there are.

As seen in the charts, offering more predictions leads to better results. However, reasonable results are already achieved at 3 options. Also, the character k-grams are best at predictions. We saw that a character 6-gram was good enough at producing readable sentences. So in this case, using character k-grams are much better!

Wednesday, September 14, 2011

Markov Fun

I recently implemented a Markov chain for an app I'm developing. Markov chains, as applied to k-grams, predict successive characters based on the previous k-1 characters. The goal is to build a Markov model of natural language and execute word prediction as a user enters in text.

My implementation works by first "training" on a given text. I found a speech online given by Robert Sedgewick to his incoming intro to CS class in 2001 which i used for this purpose. During training, the code parses the given text and determines the frequency at which a unique set of k-1 words is proceeded by another word. This "frame" is called a k-gram. For instance, in the given text, there are 3 6-grams starting with the first 5 characters, "come ": "come u","come t" (twice), and "come o" (twice). A k-gram is based on characters; an n-gram, words.

The chain is initialized by giving a seed to start predicting on. Below, I used "Welco". The code then makes a linear directed graph connecting states together. A state is just a k-gram, so neighbors differ by sliding one character. Each state can be seen as a reference frame, keeping a "snap-shot" of the most recent k characters. Thus, the Markov chain keeps a history of each "snap-shot".

Proceeding states are chosen psuedo-randomly based on the formerly parsed frequencies. When a number of states has been reached, the code stops generating new states and outputs the resulting text.

Here's 500 states of 7-grams (kgram):

Welcome to life.

As a research effort to the online, and we can absorb all knowledge.
However, we do not even having a wire OUT of people that what happens next is computer to
presenting industry)
a method for transferring knowledge.
However, we do not even any Internet.  And a Memex-like machine, and the personal computing,
but you need technology.  Let me elaborate.

GETTING INFORMATION INTO THE BRAIN

How about the process of transfer happen when we read": isn't that are
parents and protect informa

And 100 states of tri(3-)grams (n-gram):

Welcome to Princeton.  Make the most out of their prowess in raising children?
 Probably a combination of all.  Education is not a matter of course.  Most professors also
 frequently build new courses, synthesizing new material to be shared by everyone!
 
 But I've just argued that machines can't speed up our ability to
 understand books.  We can't expect machines by themselves to
 understand the book.  The researchers there were very productive:
 within a period of about 5 years, a small slice of what's known
 or optimistic because we can now.
 
 In research, the ratio of the

As the reader may notice, the bi(n)gram sounds grammatically more natural. However, is it actually more accurate and/or precise at predicting the next gram in the original text? Further tests will be done to ensure that an optimized model can provide my app the best word prediction possible.

Thursday, September 1, 2011

Buffering can be slow

I wanted to share a case where buffering before writing to disk wasn't necessarily the best option.
The task is to read an input fastA sequence file and filter out contigs of N size. Regard the first iteration of code:

if (N > LINE_MAX) {
  fprintf(stderr,"N is too large! Max N is %i\n",LINE_MAX);
  exit(EXIT_FAILURE);
}

while (fgets(line, sizeof line, fi) != NULL) {
  switch (line[0]) {
    case '>':	// read header
      /* if this is a new read, check to see if the last made the cutoff */
      if (sequence_len >= N) fprintf(fo,"%s",last_read);

      /* start recording the read */
      strcpy(last_read, line);
      sequence_len = 0;

      break;
    default:	// read sequence
      /* update the total length and total sequence of the read*/
      sequence_len += strlen(line)-1;	// -1 for newline character
      strcat(last_read,line);
      
      break;
  }
}

/* recover last read */
if (sequence_len >= N) fprintf(fo,"%s",last_read);

The algorithm parses the input, line by line. If the line is a read header (starts with '>'), then start buffering the read. If the line is sequence, buffer the read some more. Now on the next read header, see if the last read was long enough; if so, dump to output.

Now this posses one big problem: If the read is super long, i could overflow my buffer. Now this is unlikely, since the current size is FILENAME_MAX^2 (1025^2) characters. The main way to combate this is to .... not buffer!

The nature of the algorithm means at least some buffering needs to be done. But here is the new code:

while (fgets(line, sizeof line, fi) != NULL) {
  switch (line[0]) {
    case '>':	// read header

      /* start recording the read */
      strcpy(last_read, line);
      sequence_len = 0;
      has_written = 0;

      break;
    default:	// read sequence
      /* update the total length and total sequence of the read*/
      sequence_len += strlen(line)-1;	// -1 for newline character
      if (has_written < 1) strcat(last_read,line);	// write to buffer if necessary

      if (sequence_len >= N) {	// if already long enough, start writting
        if (has_written < 1) {		// if it's never written before for this read, print header too.
          fprintf(fo,"%s",last_read);
          has_written = 1;
        }
        else fprintf(fo,"%s",line);		// else, just print the line to output
      }

      break;
  }
}

This has three huge advantages:
1) I no longer buffer the read. This means I save an order of memory.
2) I am calling strcat much less. This may seen negligible, but concatenating involves finding the end of the destination string. Thus, I am saving myself an order of running complexity.
3) It's cleaner. I am no longer doing any kind of retroactive logic. As a result, I also don't need to do the weird exception outside of the while loop.

Writing to disk so often seems contrary to other tests i've done. strcat was obviously the bottle neck here and it more than made up for increased disk writing: The second implementation is 75% faster and much cleaner.