We move one on our search for overrepresented motifs in DNA sequences. I was preparing this entry when the comments to part 1 arrived. Because of that we will modify our previous code to include some suggestions and then we will change t a little bit to output the actual values we want. As Titus pointed out in his comment, when we merge the sequences we generate errors, because we artificially including motifs that are not supposed to be there. But in the end we will see that we are no after the actual number of times a motif appears in the sequence and what matters is the motif quorum. The main focus of the part one was to show the decrease in code length from C++ to Python, introduce generator functions and yield and also show the nice permutation function. Turns out by using the approach of the previous entry there is a big drawback in code execution, which is around 60 times slower than C++.
So we need to find another approach, and Mike showed us how (with some small modifications)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This time the counting is done with a
defaultdict initialized with integers (all zeroes), and
instead of generating all possible combinations (which was cool and
fast, but in the end made our script slower) a window with the input
length is slid along the each sequence and the key of the default
dictionary is the motif and the value is the actual count incremented as
we determine each motif. Checking this code performance using Linux’s
time, we get 0.2 seconds, an improvement of 5 fold on the C++ code and
300 fold on our previous Python code (Thanks Mike and everyone else for
the suggestions). What is different here is that the output is not
ordered and only the seen motifs are printed in the end. We will see
next time how to get the quorums.