Zientzilaria

Covering the bioinformatics niche and much more

Still Simulation

| Comments

We use our last code as a starting point in order to generate some real information from our simulated sequence sets. Our approach here will be the same: functions to do all the work for us and a very simple main code. We also reuse some code with applied before to count the nucleotides. Let’s see the code, discussion just after it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/usr/bin/env python

import random
import sys

def simulate_sequence(length):
    dna = ['A', 'C', 'G', 'T']
    sequence = ''
    for i in range(length):
        sequence += random.choice(dna)
    return sequence

def nucleotide_percentage(sequence):
    print str(sequence.count('A')) + ' As ',
    print str(sequence.count('C')) + ' Cs ',
    print str(sequence.count('G')) + ' Gs ',
    print str(sequence.count('T')) + ' Ts'

def sequence_identity(set):
    iden = []
    count = 0.0
    for x in range(len(set)-1):
    print str(x), str(x+1)
        for n in range(len(set[x])):
            if set[x][n] == set[x+1][n]:
                count += 1
        iden.append(count/len(set[x]))
        count = 0.0
    return iden

setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])

sequenceset = []
for i in range(setsize):
    rlength = random.randint(minlength, maxlength)
    sequenceset.append(simulate_sequence(rlength))

identity = sequence_identity(sequenceset)

for i in range(len(sequenceset)):
    print sequenceset[i]
    if i < len(sequenceset)-1:
        print 'sequence identity to next sequence : ' + str(identity[i])
    nucleotide_percentage(sequenceset[i])
    print

Well, not many new things here. The main one might the be the assignment of the variable count, which receives initially a value of 0.0, which in this case is a float. This variable is used the calculate the sequence identity which is a percentage, or a relative value. If we had initialized count with zero, our division count/len(set(x) would always be zero, due to the fact that count would have been an integer. The ideal input for this script would have equal minimum and maximum sequence lengths, as it the simulated set would look more like an alignment with no indels. Something like:

1
$>python simulation.py 10 30 30

As is pointed out in BPB, this example is more an indication that we are able to use our Python skills to actually make some real code, with some real output. One issue with this example is the fact that we only calculate sequence identity of two sequences at a time. It would be ideal to have sequence identity between all simulated sequences. This would require much more code, of course a good educational step, but it is something that can be easily obtained with classes and we will see this later on. With this entry, we finished our Section 4 and we will start Section 5 with Python’s dictionaries, moving to FASTA files and classes next.