Zientzilaria

Covering the bioinformatics niche and much more

Section 2: A Nice Summary Script

| Comments

Closing section two, let’s use everything we saw before and write a nice script that will read a sequence file (DNA) and report us of any “errors” and the number of different nucleotides. It is very simple, but a good exercise. Depending on your needs you can easily modify it to check for other characteristics of sequences, even change it to read amino acids sequences. We start with the code, comments coming after it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/env python

import sys
import re

fileentered = True
while fileentered == True:
    filename = raw_input('Please enter a file to check: ')
    if len(filename) >= 1:
        try:
            seqlist = open(filename, 'r').readlines()
        sequence = ''.join(seqlist)
        sequence = sequence.replace('\n', '')
            totalA = sequence.count('A')
            totalC = sequence.count('C')
            totalG = sequence.count('G')
            totalT = sequence.count('T')
            otherletter = re.compile('[BDEFHIJKLMNOPQRSUVXZ]')
        extra = re.findall(otherletter, sequence)
        output = open(filename+'.count', 'w')
        output.write('Count report for file ' + filename + '\n')
        output.write('A = ' + str(totalA) + '\n')
        output.write('C = ' + str(totalC) + '\n')
        output.write('G = ' + str(totalG) + '\n')
        output.write('T = ' + str(totalT) + '\n')
        if len(extra) > 0:
            output.write('Also were found ' + str(len(extra)) + ' errors\n')
            for i in extra:
            output.write(i + ' ')
        else:
        output.write('No error found')
            output.close()
        print 'Result file saved on ' + filename + '.count'
        except:
            print 'File not found. Please try again.'
    else:
        sys.exit()

The main change here is that we use a while loop to control de program flow. Notice that we import sys and re. Basically we ask for an user input, the filename, and depending on the input given we process the file or exit the program. The early exit is done with the sys.exit method which is a shortcut to get out of the script processing. Very handy. If the input is valid we try to open it. If the operation is successful, great, we read the file, count the nucleotides and use a quite scary regular expression to search all the “errors” in our sequence. The regex is compiled with the pattern '[BDEFHIJKLMNOPQRSUVXZ]' which means “match any character in this range”. Take a closer look at the pattern and you will see that is every letter in the alphabet, except for ACGT. And when searching for these “errors” instead of using re.search we use re.findall, which conveniently returns a list with all the substrings found. That’s even more handy. On the final part of the script we take care of the output, opening a file called .count where we print the counts and the errors, if they actually exist. One thing I left from the previous post, is that we need to close the file opened to write. In some cases if the file is not properly closed, errors might occur. So always remember to close the files used for writing/appending, using .close().