Zientzilaria

Covering the bioinformatics niche and much more

Splitting a Multiple Sequence FASTA File, Making It Better

| Comments

On the last entry we saw how to split a multiple FASTA file with our previous FASTA module. The previous script saved a sequence per file, where the filename was identical to the original file with the exception of a number added to the end. Let’s say we have the file mysequences.fa, our script would save mysequences.fa1, mysequences.fa2, …, mysequences.fan. This is fine, but we need to make it better. Ideally we should ask the user for a prefix or a filename for the output. On the other hand, it would be an improvement if we put the number before the file tag. We will do that in a couple of lines. First we need to find the tag and the actual filename. We could go through a longer route if we used the find method to find the dot and then extract portions of the string to the actual name and tag.

But we can use split

1
2
3
4
file = sys.argv[1]
temp = file.split('.')
filename = temp[0]
tag = temp[1]

We split the original string at the dot and the first item or the returned list is the file name and the second item is the tag. Now we just need to change a little bit the code for the output and we are done. The final script is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
\#!/usr/bin/env python
import sys
import fasta
file = sys.argv[1]
temp = file.split('.')
filename = temp[0]
tag = temp[1]
sequences = fasta.read_fasta(open(file, 'r').readlines())
count = 1
for i in sequences:
  #this is the only different line 
  output = open(filename+'_'+str(count)+'.'+tag, 'w')
  output.write(i.name+'\\n')
  output.write(i.sequence)
  count += 1