Before we see how to improve the speed in our Python code (would it be possible?), let’s go back to our GenBank parsing algorithm. Last time we saw how to extract the proteins from the CDS entries along the file. This time we will see how to get the DNA from the file. Usually DNA sequence is stored at the bottom of the GenBank file, in a long sequence (if the file is a genome, for instance). Basically, what we need to do is to get the start and end position of each gene (or any annotated piece) and cut the long sequence at the bottom. This will require a couple of changes in our previous script, basically to get the DNA and to cut it. Previously, we had this to read the file
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Notice that we stop when we reach the bottom DNA sequence. We need to go further, but there is another issue: DNA sequence stored in a GenBank file are stored in blocks of nucleotides and the initial position in each line is also there. So we need to join these blocks and remove the number from the beginning of the line. In the end we would have this
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
We added
a couple of things. Mainly a flag boolean variable (is_seq
) that tells
the program that we have reached “DNA level” and a routine to join the
blocks from the second one to the end. sequence
is list at first,
appending blocks of actual DNA sequence and numbers obtained from the
split
method. We use a “blank” split
, what means it will break the
string in the spaces. Our join
method above joining list elements from
the element 1 and forward. Element 0 in this case would be the numbers
indicating sequence position. The function to parse the entries and the
class to store the information also need to be changed. Previously, our
class was simpler, with less objects. This time we need to extract more
information.
1 2 3 4 5 6 7 |
|
We removed the sequence object, but added a start, end and a complement, for the cases where the gene is on the reverse complement. The function to extract the info looks like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
We will discuss the parsing function in the next entry and wrap up both codes, and add them to the repository.