CodeTrials: 2011

Wednesday, October 5, 2011

SOAPdenovo: free(): invalid pointer:

I struggled with SOAPdenovo for 3 hours today. It was crashing at pregraph, shortly after parsing files for the first time. I singled it down to 1 file:

SOAPdenovo31mer pregraph -s test_05.cfg -o test_05 -p 24

Version 1.3: released on Nov 23th, 2009

In test_05.cfg, 1 libs, max seq len 1195, max name len 256

24 thread created
read from file:
 test_1.fa
time spent on hash reads: 10s, 129244 reads processed
[LIB] 0, avg_ins 10000, reverse 1 
22311858 nodes allocated, 12022388 kmer in reads, 12022388 kmer processed
*** glibc detected *** SOAPdenovo31mer: free(): invalid pointer: 0x00000000089b0203 ***
======= Backtrace: =========
[0x44cd16]
[0x450857]
[0x40472d]
[0x400316]
[0x434e93]
[0x440250]
[0x4001b9]
======= Memory map: ========
00400000-004d4000 r-xp 00000000 08:01 25826017                           /usr/local/bin/SOAPdenovo
006d4000-006d6000 rw-p 000d4000 08:01 25826017                           /usr/local/bin/SOAPdenovo
006d6000-00750000 rw-p 00000000 00:00 0 
026d3000-08a5e000 rw-p 00000000 00:00 0                                  [heap]
41810000-41811000 ---p 00000000 00:00 0 
.....

Turns out the problem wasn't my input, but my configuration settings. The library had reverse_seq=1 for a single-read fasta file. Returning reverse_seq=0 fixed my problem...

Friday, September 16, 2011

Markov Statistical Fun

Per my last post, I offer some statistics of some experimentation I've done with my Markov implementation. I wrote some code to compare the true future state of the text to predictions of the top (based on transition probability) 1,3,5, or 7 potential states (based on training data) to see the performance of different Markov models.

My training data is 20 years worth of State of the Union Addresses (Bush Sr., Clinton, Bush Jr.). There were 895,163 characters (86 unique) and 154,934 words (4,310 unique.) The testing data is a State of the Union Address by Carter. There were 26,610 characters (68 unique) and 4,592 words (1,543 unique.)

Here is a plot of frequencies of unique k-grams:

The log y scale may be deceiving. For instance, the testing text has 4% the unique 10-grams as the training text.

Here are the success rates for all opportunities to predict the next state:

The left part specifies character k-grams; the right, word n-grams.

Here are the success rates for all attempts to predict the next state: (Attempts are anytime the current text's state is known by the trained model.)

It appears that as the k gets larger, more unknown k-grams are encountered in the text. This makes sense as each increment of k is exponentially more possibilities. e.g. for characters 26^k, not counting spaces and punctuation. Although effectively, the english language offers many less possibilities. It would be interesting to study how many spelled unique english k-grams there are.

As seen in the charts, offering more predictions leads to better results. However, reasonable results are already achieved at 3 options. Also, the character k-grams are best at predictions. We saw that a character 6-gram was good enough at producing readable sentences. So in this case, using character k-grams are much better!

Wednesday, September 14, 2011

Markov Fun

I recently implemented a Markov chain for an app I'm developing. Markov chains, as applied to k-grams, predict successive characters based on the previous k-1 characters. The goal is to build a Markov model of natural language and execute word prediction as a user enters in text.

My implementation works by first "training" on a given text. I found a speech online given by Robert Sedgewick to his incoming intro to CS class in 2001 which i used for this purpose. During training, the code parses the given text and determines the frequency at which a unique set of k-1 words is proceeded by another word. This "frame" is called a k-gram. For instance, in the given text, there are 3 6-grams starting with the first 5 characters, "come ": "come u","come t" (twice), and "come o" (twice). A k-gram is based on characters; an n-gram, words.

The chain is initialized by giving a seed to start predicting on. Below, I used "Welco". The code then makes a linear directed graph connecting states together. A state is just a k-gram, so neighbors differ by sliding one character. Each state can be seen as a reference frame, keeping a "snap-shot" of the most recent k characters. Thus, the Markov chain keeps a history of each "snap-shot".

Proceeding states are chosen psuedo-randomly based on the formerly parsed frequencies. When a number of states has been reached, the code stops generating new states and outputs the resulting text.

Here's 500 states of 7-grams (kgram):

Welcome to life.

As a research effort to the online, and we can absorb all knowledge.
However, we do not even having a wire OUT of people that what happens next is computer to
presenting industry)
a method for transferring knowledge.
However, we do not even any Internet.  And a Memex-like machine, and the personal computing,
but you need technology.  Let me elaborate.

GETTING INFORMATION INTO THE BRAIN

How about the process of transfer happen when we read": isn't that are
parents and protect informa

And 100 states of tri(3-)grams (n-gram):

Welcome to Princeton.  Make the most out of their prowess in raising children?
 Probably a combination of all.  Education is not a matter of course.  Most professors also
 frequently build new courses, synthesizing new material to be shared by everyone!
 
 But I've just argued that machines can't speed up our ability to
 understand books.  We can't expect machines by themselves to
 understand the book.  The researchers there were very productive:
 within a period of about 5 years, a small slice of what's known
 or optimistic because we can now.
 
 In research, the ratio of the

As the reader may notice, the bi(n)gram sounds grammatically more natural. However, is it actually more accurate and/or precise at predicting the next gram in the original text? Further tests will be done to ensure that an optimized model can provide my app the best word prediction possible.

Thursday, September 1, 2011

Buffering can be slow

I wanted to share a case where buffering before writing to disk wasn't necessarily the best option.
The task is to read an input fastA sequence file and filter out contigs of N size. Regard the first iteration of code:

if (N > LINE_MAX) {
  fprintf(stderr,"N is too large! Max N is %i\n",LINE_MAX);
  exit(EXIT_FAILURE);
}

while (fgets(line, sizeof line, fi) != NULL) {
  switch (line[0]) {
    case '>':	// read header
      /* if this is a new read, check to see if the last made the cutoff */
      if (sequence_len >= N) fprintf(fo,"%s",last_read);

      /* start recording the read */
      strcpy(last_read, line);
      sequence_len = 0;

      break;
    default:	// read sequence
      /* update the total length and total sequence of the read*/
      sequence_len += strlen(line)-1;	// -1 for newline character
      strcat(last_read,line);
      
      break;
  }
}

/* recover last read */
if (sequence_len >= N) fprintf(fo,"%s",last_read);

The algorithm parses the input, line by line. If the line is a read header (starts with '>'), then start buffering the read. If the line is sequence, buffer the read some more. Now on the next read header, see if the last read was long enough; if so, dump to output.

Now this posses one big problem: If the read is super long, i could overflow my buffer. Now this is unlikely, since the current size is FILENAME_MAX^2 (1025^2) characters. The main way to combate this is to .... not buffer!

The nature of the algorithm means at least some buffering needs to be done. But here is the new code:

while (fgets(line, sizeof line, fi) != NULL) {
  switch (line[0]) {
    case '>':	// read header

      /* start recording the read */
      strcpy(last_read, line);
      sequence_len = 0;
      has_written = 0;

      break;
    default:	// read sequence
      /* update the total length and total sequence of the read*/
      sequence_len += strlen(line)-1;	// -1 for newline character
      if (has_written < 1) strcat(last_read,line);	// write to buffer if necessary

      if (sequence_len >= N) {	// if already long enough, start writting
        if (has_written < 1) {		// if it's never written before for this read, print header too.
          fprintf(fo,"%s",last_read);
          has_written = 1;
        }
        else fprintf(fo,"%s",line);		// else, just print the line to output
      }

      break;
  }
}

This has three huge advantages:
1) I no longer buffer the read. This means I save an order of memory.
2) I am calling strcat much less. This may seen negligible, but concatenating involves finding the end of the destination string. Thus, I am saving myself an order of running complexity.
3) It's cleaner. I am no longer doing any kind of retroactive logic. As a result, I also don't need to do the weird exception outside of the while loop.

Writing to disk so often seems contrary to other tests i've done. strcat was obviously the bottle neck here and it more than made up for increased disk writing: The second implementation is 75% faster and much cleaner.

Thursday, August 11, 2011

How to install the ABySS assembler on OSX

ugh.
After venting all my tribulations in an unpublished blog post. Here is how you install ABySS 1.2.7 on OS X. Keep in mind you need X Code installed.

- download 1.2.7
- make a directory and enter it:

mkdir MAKE
cd MAKE

- run configure. You can set the path of installation with prefix, if desired (it automatically will install in /bin). Then make to install:

../configure --disable-openmp --prefix=/usr/local
make AM_CXXFLAGS=-Wall
sudo make install

- add ABySS to your path (in .profile).

Booya! Now ABySS finally works! I'm not sure if this will work with mpi; See this wiki for details.

Thursday, July 28, 2011

Validate program installation with C++ and system()

Currently I am writing an analysis script. This script outputs a graph generated by a third party program called gnuplot. In order to make the script more robust, I wanted to be able to abort at startup if gnuplot had not been installed. Hence, the post.

Requirements:
I need something that will quitely find out if the command exists and then kill the script.

Discussion:
I've been doing a lot of 'man 3' reading and scouring around the web looking for a quick and easy way to check if a program is installed. I tried to stay away from system() as other sources deemed it very dangerous and even *EVIL*. Thus, I thought I had to absolutely stay away from using system().

I tried using many of the std C exec* functions after calling a fork and writing my own function that buffered popen results. However, this seemed like too much code and must have been redundant to existing libraries. I hesitated using popen(), as it is basically system(), only buffered (both system and popen just fork and call a process with the shell). This lead me to hesitate the security of popen() as well. Turns out it, it didn't really matter.

Finally an answer:
The issue with system() is that it *can* be very dangerous. In system() the child process runs with the same user privileges as the parent. Thus, with admin rights, one can accidentally run system("rm -rf /"). Hence, problems occur when the programmer is imprudent with the system() argument. Under computer security pragma, this means system should never be used, as it could mess up. However, I finally concluded with my software developer sense that system() really is fine, especially with a static string as an argument. That being said, here are the lines:

if (256 == system("hash gnuplot 2>&-")) {
  cout << "Gnuplot must be installed to run script. Aborting.\n";
  exit(EXIT_SUCCESS);
}

Enjoy!

Tuesday, July 19, 2011

Hello World

Hi!
My name is Ian Schillebeeckx and I am a young programmer interested in Algorithmic and Architectural Optimization and Android coding. I am currently interning at a gene sequencing company, Cofactor Genomics, where I am refactoring scripts in their existing pipeline. For fun, I code for the Android platform *slides glasses up nose*.

Through my own interest and my brother's encouragement, i've decided to finally start a blog and publish some of my trials and tribulations of my work and hobby. Hopefully my time can be of use later by others!

Anyways, here's the inaugural block of code:

int main() {
 cout << "!!!Hello World!!!" << endl;
 return 0;
}