The task is to read an input fastA sequence file and filter out contigs of N size. Regard the first iteration of code:
if (N > LINE_MAX) {
fprintf(stderr,"N is too large! Max N is %i\n",LINE_MAX);
exit(EXIT_FAILURE);
}
while (fgets(line, sizeof line, fi) != NULL) {
switch (line[0]) {
case '>': // read header
/* if this is a new read, check to see if the last made the cutoff */
if (sequence_len >= N) fprintf(fo,"%s",last_read);
/* start recording the read */
strcpy(last_read, line);
sequence_len = 0;
break;
default: // read sequence
/* update the total length and total sequence of the read*/
sequence_len += strlen(line)-1; // -1 for newline character
strcat(last_read,line);
break;
}
}
/* recover last read */
if (sequence_len >= N) fprintf(fo,"%s",last_read);
The algorithm parses the input, line by line. If the line is a read header (starts with '>'), then start buffering the read. If the line is sequence, buffer the read some more. Now on the next read header, see if the last read was long enough; if so, dump to output.
Now this posses one big problem: If the read is super long, i could overflow my buffer. Now this is unlikely, since the current size is FILENAME_MAX^2 (1025^2) characters. The main way to combate this is to .... not buffer!
The nature of the algorithm means at least some buffering needs to be done. But here is the new code:
while (fgets(line, sizeof line, fi) != NULL) {
switch (line[0]) {
case '>': // read header
/* start recording the read */
strcpy(last_read, line);
sequence_len = 0;
has_written = 0;
break;
default: // read sequence
/* update the total length and total sequence of the read*/
sequence_len += strlen(line)-1; // -1 for newline character
if (has_written < 1) strcat(last_read,line); // write to buffer if necessary
if (sequence_len >= N) { // if already long enough, start writting
if (has_written < 1) { // if it's never written before for this read, print header too.
fprintf(fo,"%s",last_read);
has_written = 1;
}
else fprintf(fo,"%s",line); // else, just print the line to output
}
break;
}
}
This has three huge advantages:
1) I no longer buffer the read. This means I save an order of memory.
2) I am calling strcat much less. This may seen negligible, but concatenating involves finding the end of the destination string. Thus, I am saving myself an order of running complexity.
3) It's cleaner. I am no longer doing any kind of retroactive logic. As a result, I also don't need to do the weird exception outside of the while loop.
Writing to disk so often seems contrary to other tests i've done. strcat was obviously the bottle neck here and it more than made up for increased disk writing: The second implementation is 75% faster and much cleaner.
No comments:
Post a Comment