We use our last code as a starting point in order to generate some real information from our simulated sequence sets. Our approach here will be the same: functions to do all the work for us and a very simple main code. We also reuse some code with applied before to count the nucleotides. Let’s see the code, discussion just after it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Well, not many
new things here. The main one might the be the assignment of the
count, which receives initially a value of
0.0, which in
this case is a float. This variable is used the calculate the sequence
identity which is a percentage, or a relative value. If we had
count with zero, our division
always be zero, due to the fact that count would have been an integer.
The ideal input for this script would have equal minimum and maximum
sequence lengths, as it the simulated set would look more like an
alignment with no indels. Something like:
As is pointed out in BPB, this example is more an indication that we are able to use our Python skills to actually make some real code, with some real output. One issue with this example is the fact that we only calculate sequence identity of two sequences at a time. It would be ideal to have sequence identity between all simulated sequences. This would require much more code, of course a good educational step, but it is something that can be easily obtained with classes and we will see this later on. With this entry, we finished our Section 4 and we will start Section 5 with Python’s dictionaries, moving to FASTA files and classes next.