Zientzilaria

Covering the bioinformatics niche and much more

A Script to Simulate DNA Sequence Sets

| Comments

In Beginning Perl for Bioinformatics the chapter that covers simulated mutations on a DNA sequence is quite verbose and the code examples employ some subroutines to do what we have done on the last post. Basically the code example that generates a random DNA sequence is the last one on the chapter, but it was the first one we covered. Also this code example has a twist that our code from the last post does not have, which is it allows you to generate a set of sequences with different length instead of one sequence with fixed length that our script does. So, the script that generates a user defined simulated DNA sequence set is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

#!/usr/bin/env python

import random
import sys

def simulate_sequence(length):
    dna = ['A', 'C', 'G', 'T']
    sequence = ''
    for i in range(length):
        sequence += random.choice(dna)
    return sequence

setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])

sequenceset = []
for i in range(setsize):
    rlength = random.randint(minlength, maxlength)
    sequenceset.append(simulate_sequence(rlength))

for sequence in sequenceset:
    print sequence

Simple and efficient. First we define a function that generates a simulated DNA sequence from the four nucleotides, again using the random.choice. The sequence length is based on the parameter received by the function. The sequence length is another random number defined in the main body of the script. This random number is generated by random.randint with a range based on the arguments given by the user when running the script. And finally, the number of sequences to be simulated is define by the first parameter. Seventeen lines. Next we will see how to draw some scientific information about the sequences, such as sequence identity and nucleotide frequency.