Monday, August 20, 2012

Back to life with Arduino

In standard blog fashion, i've gone silent with posts. My personal life and "workaphilia" has led me to focus my energies on other (analog) pursuits. I'm now out of industry and in the protective cocoon of academia. I start my doctorate in a week and have picked up a new project that will mesh with my sailing interests and , hopefully, eventually extend to my degree.

We have great electronics to give real-time stats and boat speed and heading through the velocitek and tactic. However, they are pretty closed and, hence, it is difficult to get data off of these tools. Furthermore, the senors have (at least at output) low sampling rate (gps is about 2 hz.) In addition, there isn't any software to be able to directly compare individual boat performances across a fleet.

Thus, I've sought to try to develop a low cost, open-source hardware to be able to measure various dynamic properties of the boat. I am currently focusing on boat position and axial orientation (gps and gyroscope.) My dream is to achieve two fold: a) to be able to empirically measure two-boat testing and eventually b) create enough data for a machine learning application. I still need to think of a name for my project...

In my first prototype, I've ordered the Arduino Uno, a triple-axis gyroscope, and a gps breakout. I'm excited to have great a gryo at ~30 kHz (!) and a gps at 10 Hz. These high sampling rates should allow me to do some smoothing and get accurate recordings. My brother is already proving to be an awesome asset, as he has already been playing around with an Uno, building a camera slider.

Quickly, I've gotten the gyro to talk to the Uno and produce output. I'm waiting for the arrival of my gps unit, but I have started developing a library to help keep my onboard software organized. Hopefully, I'll have all the hardware talking and logging by a week or two. I'm already thinking about adding wifi shields to get real time info out. From there, I'll build a second sensor nest.

Our sailing season ends in 2 months, so I'll be on a crunch to getting something out. In mid-september, I'll be very happy to have to have two devices talking to a android device. Then, I can start writing higher level api (android side) to start making two-boat comparisons!

Saturday, February 25, 2012

AGBT 2012

Like half of all blog posts on the internet, I have to start by apologizing for neglecting to update the blog regularly. No reason to get into details, so I'll go straight to the post.

Last week, while I was learning to tango in Buenos Aires, my colleagues presented a poster we put together detailing the many hours we've put into establishing a procedure for doing hybrid de novo assemblies with Illumina and 454 sequencing data. The experience was great and perhaps my first front-to-back computational research project. We achieved some synergistic results and have since included the procedures into our own analysis offerings. We are receiving some flattering attention from the BROAD institute as well as from everywhere across the web-sphere. Below is the abstract and here is the poster.


Evaluation of Strategies for De Novo Assembly of Genomes and Transcriptomes Using Combined Illumina and Roche 454 Sequencing Data.

Jon R. Armstrong, Jarret I. Glasscock, Ian Schillebeeckx, and Navish Dadighat.
Cofactor Genomics, St. Louis, MO

The emergence of next-generation sequencing platforms has made low-cost sequencing an attractive approach for de novo assembly of genomes and transcriptomes. The two most widely used next-generation platforms for de novo assembly are the Illumina and Roche 454 and each system has particular strengths and weaknesses. The Illumina generates short reads (100 bp) while the Roche 454 FLX+ produces read lengths close to Sanger (800 bp), however the price per base on the Roche 454 machine is approximately 50x more costly than the Illumina platform. Few studies exist which employ assembly of both read types [1, 2], and the methods used are not explicitly described. If strategies were determined to assemble data from both platforms, and capitalize on the strengths of each system while minimizing the weaknesses and cost, researchers could apply them during future de novo assembly projects. To this end, we evaluated multiple strategies for de novo assembly of combined Illumina and Roche 454 sequencing data, originating from several genomes and transcriptomes of varying size and origin. Our analysis shows that assembly of the Roche 454 reads, prior to combining with Illumina raw reads, produces the best assembly metrics for genomes and transcriptomes, however, depending on the type of assembly it may be detrimental to assemble Illumina reads prior to combining with Roche 454 data. We will also present the types of assemblers used and workflows specific to genome and transcriptome assemblies.

1. Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL. De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res. 2009 February; 19(2): 294–305.

2. Nowrousian M, Stajich JE, Chu M, Engh I, Espagne E, Halliday K, Kamerewerd J, Kempken F, Knab B, Kuo HC, Osiewacz HD, Pöggeler S, Read ND, Seiler S, Smith KM, Zickler D, Kück U, Freitag M. De novo assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis. PLoS Genet. 2010 Apr 8;6(4):e1000891.

Wednesday, October 5, 2011

SOAPdenovo: free(): invalid pointer:

I struggled with SOAPdenovo for 3 hours today. It was crashing at pregraph, shortly after parsing files for the first time. I singled it down to 1 file:

SOAPdenovo31mer pregraph -s test_05.cfg -o test_05 -p 24

Version 1.3: released on Nov 23th, 2009

In test_05.cfg, 1 libs, max seq len 1195, max name len 256

24 thread created
read from file:
 test_1.fa
time spent on hash reads: 10s, 129244 reads processed
[LIB] 0, avg_ins 10000, reverse 1 
22311858 nodes allocated, 12022388 kmer in reads, 12022388 kmer processed
*** glibc detected *** SOAPdenovo31mer: free(): invalid pointer: 0x00000000089b0203 ***
======= Backtrace: =========
[0x44cd16]
[0x450857]
[0x40472d]
[0x400316]
[0x434e93]
[0x440250]
[0x4001b9]
======= Memory map: ========
00400000-004d4000 r-xp 00000000 08:01 25826017                           /usr/local/bin/SOAPdenovo
006d4000-006d6000 rw-p 000d4000 08:01 25826017                           /usr/local/bin/SOAPdenovo
006d6000-00750000 rw-p 00000000 00:00 0 
026d3000-08a5e000 rw-p 00000000 00:00 0                                  [heap]
41810000-41811000 ---p 00000000 00:00 0 
.....

Turns out the problem wasn't my input, but my configuration settings. The library had reverse_seq=1 for a single-read fasta file. Returning reverse_seq=0 fixed my problem...

Friday, September 16, 2011

Markov Statistical Fun

Per my last post, I offer some statistics of some experimentation I've done with my Markov implementation. I wrote some code to compare the true future state of the text to predictions of the top (based on transition probability) 1,3,5, or 7 potential states (based on training data) to see the performance of different Markov models.

My training data is 20 years worth of State of the Union Addresses (Bush Sr., Clinton, Bush Jr.). There were 895,163 characters (86 unique) and 154,934 words (4,310 unique.) The testing data is a State of the Union Address by Carter. There were 26,610 characters (68 unique) and 4,592 words (1,543 unique.)

Here is a plot of frequencies of unique k-grams:

The log y scale may be deceiving. For instance, the testing text has 4% the unique 10-grams as the training text.

Here are the success rates for all opportunities to predict the next state:

The left part specifies character k-grams; the right, word n-grams.

Here are the success rates for all attempts to predict the next state: (Attempts are anytime the current text's state is known by the trained model.)

It appears that as the k gets larger, more unknown k-grams are encountered in the text. This makes sense as each increment of k is exponentially more possibilities. e.g. for characters 26^k, not counting spaces and punctuation. Although effectively, the english language offers many less possibilities. It would be interesting to study how many spelled unique english k-grams there are.

As seen in the charts, offering more predictions leads to better results. However, reasonable results are already achieved at 3 options. Also, the character k-grams are best at predictions. We saw that a character 6-gram was good enough at producing readable sentences. So in this case, using character k-grams are much better!

Wednesday, September 14, 2011

Markov Fun

I recently implemented a Markov chain for an app I'm developing. Markov chains, as applied to k-grams, predict successive characters based on the previous k-1 characters. The goal is to build a Markov model of natural language and execute word prediction as a user enters in text.

My implementation works by first "training" on a given text. I found a speech online given by Robert Sedgewick to his incoming intro to CS class in 2001 which i used for this purpose. During training, the code parses the given text and determines the frequency at which a unique set of k-1 words is proceeded by another word. This "frame" is called a k-gram. For instance, in the given text, there are 3 6-grams starting with the first 5 characters, "come ": "come u","come t" (twice), and "come o" (twice). A k-gram is based on characters; an n-gram, words.

The chain is initialized by giving a seed to start predicting on. Below, I used "Welco". The code then makes a linear directed graph connecting states together. A state is just a k-gram, so neighbors differ by sliding one character. Each state can be seen as a reference frame, keeping a "snap-shot" of the most recent k characters. Thus, the Markov chain keeps a history of each "snap-shot".

Proceeding states are chosen psuedo-randomly based on the formerly parsed frequencies. When a number of states has been reached, the code stops generating new states and outputs the resulting text.

Here's 500 states of 7-grams (kgram):

Welcome to life.

As a research effort to the online, and we can absorb all knowledge.
However, we do not even having a wire OUT of people that what happens next is computer to
presenting industry)
a method for transferring knowledge.
However, we do not even any Internet.  And a Memex-like machine, and the personal computing,
but you need technology.  Let me elaborate.

GETTING INFORMATION INTO THE BRAIN

How about the process of transfer happen when we read": isn't that are
parents and protect informa

And 100 states of tri(3-)grams (n-gram):

Welcome to Princeton.  Make the most out of their prowess in raising children?
 Probably a combination of all.  Education is not a matter of course.  Most professors also
 frequently build new courses, synthesizing new material to be shared by everyone!
 
 But I've just argued that machines can't speed up our ability to
 understand books.  We can't expect machines by themselves to
 understand the book.  The researchers there were very productive:
 within a period of about 5 years, a small slice of what's known
 or optimistic because we can now.
 
 In research, the ratio of the

As the reader may notice, the bi(n)gram sounds grammatically more natural. However, is it actually more accurate and/or precise at predicting the next gram in the original text? Further tests will be done to ensure that an optimized model can provide my app the best word prediction possible.

Thursday, September 1, 2011

Buffering can be slow

I wanted to share a case where buffering before writing to disk wasn't necessarily the best option.
The task is to read an input fastA sequence file and filter out contigs of N size. Regard the first iteration of code:

if (N > LINE_MAX) {
  fprintf(stderr,"N is too large! Max N is %i\n",LINE_MAX);
  exit(EXIT_FAILURE);
}

while (fgets(line, sizeof line, fi) != NULL) {
  switch (line[0]) {
    case '>':	// read header
      /* if this is a new read, check to see if the last made the cutoff */
      if (sequence_len >= N) fprintf(fo,"%s",last_read);

      /* start recording the read */
      strcpy(last_read, line);
      sequence_len = 0;

      break;
    default:	// read sequence
      /* update the total length and total sequence of the read*/
      sequence_len += strlen(line)-1;	// -1 for newline character
      strcat(last_read,line);
      
      break;
  }
}

/* recover last read */
if (sequence_len >= N) fprintf(fo,"%s",last_read);

The algorithm parses the input, line by line. If the line is a read header (starts with '>'), then start buffering the read. If the line is sequence, buffer the read some more. Now on the next read header, see if the last read was long enough; if so, dump to output.

Now this posses one big problem: If the read is super long, i could overflow my buffer. Now this is unlikely, since the current size is FILENAME_MAX^2 (1025^2) characters. The main way to combate this is to .... not buffer!

The nature of the algorithm means at least some buffering needs to be done. But here is the new code:

while (fgets(line, sizeof line, fi) != NULL) {
  switch (line[0]) {
    case '>':	// read header

      /* start recording the read */
      strcpy(last_read, line);
      sequence_len = 0;
      has_written = 0;

      break;
    default:	// read sequence
      /* update the total length and total sequence of the read*/
      sequence_len += strlen(line)-1;	// -1 for newline character
      if (has_written < 1) strcat(last_read,line);	// write to buffer if necessary

      if (sequence_len >= N) {	// if already long enough, start writting
        if (has_written < 1) {		// if it's never written before for this read, print header too.
          fprintf(fo,"%s",last_read);
          has_written = 1;
        }
        else fprintf(fo,"%s",line);		// else, just print the line to output
      }

      break;
  }
}

This has three huge advantages:
1) I no longer buffer the read. This means I save an order of memory.
2) I am calling strcat much less. This may seen negligible, but concatenating involves finding the end of the destination string. Thus, I am saving myself an order of running complexity.
3) It's cleaner. I am no longer doing any kind of retroactive logic. As a result, I also don't need to do the weird exception outside of the while loop.

Writing to disk so often seems contrary to other tests i've done. strcat was obviously the bottle neck here and it more than made up for increased disk writing: The second implementation is 75% faster and much cleaner.

Thursday, August 11, 2011

How to install the ABySS assembler on OSX

ugh.
After venting all my tribulations in an unpublished blog post. Here is how you install ABySS 1.2.7 on OS X. Keep in mind you need X Code installed.

- download 1.2.7
- make a directory and enter it:
mkdir MAKE
cd MAKE
- run configure. You can set the path of installation with prefix, if desired (it automatically will install in /bin). Then make to install:
../configure --disable-openmp --prefix=/usr/local
make AM_CXXFLAGS=-Wall
sudo make install
- add ABySS to your path (in .profile).

Booya! Now ABySS finally works! I'm not sure if this will work with mpi; See this wiki for details.