Zientzilaria

Covering the bioinformatics niche and much more

Alternative Methods to Split a FASTA File - Updated (Again)

| Comments

As Daniel didn’t enlightened us on how to use csplit, I am posting several ways on how to split a multiple sequence FASTA file. This post gets out of our focus (if you haven’t noticed, our focus here is Python, and maybe suffers from the NIH effect. Not invented here. We will be back with our normal programming after this. We start with our “widely” available option csplit:

csplit -kf seq file '%^>%' '/^>/' '{*}'

where k tells the program to keep files in case of an error, f sets a prefix for the output files (otherwise xx00 is used) and two regex patterns are used: between %s is the one to skip and between /s is the one to copy up to but not including the line. In {} is the number of times we want to have the previous patterns repeated, a * meaning as many time as possible. Easy, eh? Now, we can check how perl fares. I am not a perl monger so I googled it and I found an one-liner. Quoting the original site: Split a multi-sequence FastA file into individual files named after their description lines.

perl -ne 'BEGIN{$/=">"}if(/^\s*(\S+)/){open(F,">$1")||warn"$1 write failed:$!\n";chomp;print F ">", $_}'

and I am not daring to explain this one. But cariaso did and also sent a better version of it

1
2
3
4
5
6
7
8
9
10
# This magic variable makes perl read lines that end with > 
# instead of a newline n $/=">"; while (<>) { 
  \# foreach line in the input files 
  if (/\^\>(w+)/) {
      \# grab the first word of text 
      open(F,"\>$1") ||
      # open a file named that word
       warn "$1 write failed:$!n"; chomp;
      # strip off the \> at the end print F "\>", $\_; \#
print >text to the file } }

You can also do in Python, without any OO programming in 5 lines:

1
2
3
4
5
file = open('myfile').readlines()
str = ''.join(file)
temp = str.split('>')
for i in temp:
  print '>' + i

that can be run with no difficulties from the interactive console. And which cariaso also corrected and improved

1
2
3
4
5
6
fileobj = open("myfile.fasta")
ignore = fileobj.read(1)
text = fileobj.read()
records = text.split("\>")
for i in records:
  print '\>' + i

along the offer of a nicer approach

1
2
3
4
5
6
7
8
9
10
11
12
def eachfasta(fileobj):
  sofar = fileobj.readline()
  for line in fileobj:
      if '>' == line (bracket zero close bracket):
          yield sofar
          sofar = line
      else:
          sofar += line
          yield sofar
fileobj = open("myfile.fasta")
for i in eachfasta(fileobj):
  print i

Also Joe wanted to dip in and sent his approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# f6smod.py 10-26-07 jto 
# purpose: modify program by Michael Cariaso, given in Paolo Nuin's
blog:
# http://python.genedrift.org 
import sys
fileobj = open('fasta.seq')
ignore = fileobj.read(1)
# The above line removes the first char from the file object. 
# It can also be used to check that the first char is a '>' 
if ignore != '>':
  print "The first character in the supposed FASTA file is: " + ignore
  print "not '>', so sys.exit() is being invoked."
  sys.exit()
text = fileobj.read() records = text.split('\>')
# Here, rather than use the for loop to just print out the sequences, it 
# is used to store them in a list. After that they can be printed out, or 
# stored in separate files, or or further split into header line and sequence 
# (using the carriage return at the end of the header file).
seqlist = []
listcount = 0 \# store each header/sequence in a list 
for i in records:
  i = '>' + i
  seqlist.append(i)
  listcount += 1
# print the list 
for seq in seqlist:
  print seq \# Split into header line and sequence, and make the sequence a single string. 
for seq in seqlist:
  splitCR = seq.split('\n')
  print "header: " + splitCR[0]
  sequence = ''.join(splitCR[1:])
  print "sequence: "
  print sequence

OK, so three five six more options available. I am pretty sure sed and awk would be able to do it glued by a bash script, but I am far of being an expert in sed or awk. If you know how to do it, leave a comment.