Zientzilaria

Covering the bioinformatics niche and much more

Finding Motifs: IUPAC and RegEx, an Approach

| Comments

After a long delay, we are back. Before entering in the next topic, Restriction Enzymes, let’s take a look how to create a regex pattern from user input and the dictionary of IUPAC code for nucleotides. We will use the same dictionary from the previous entry

1
2
3
iupacdict = {'M':'[AC]', 'R':'[AG]', 'W':'[AT]',
'S':'[CG]', 'Y':'[CT]', 'K':'[GT]', 'V':'[ACG]', 'H':'[ACT]',
'D':'[AGT]', 'B':'[CGT]', 'X':'[ACGT]', 'N':'[ACGT]'}

and the same consensus sequence of the GATA3 binding site NNGATARNG This consensus sequence will be provided as a script parameter, along with the filename

1
2
3
4
import sys
import re
sequencefile = open(sys.argv[1], 'r').readlines()
motif = sys.argv[2]

First line reads the sequence file from the user input and second line stores the input motif in a string. Now we have to get the motif string and check for each letter (IUPAC code) and get the correspondent set of nucleotides. For this we will use a loop and a the method get method of Python dictionaries. This method, as its name implies, gets the value of the key in parentheses. Like this

1
2
for n in motif:
  iupacdict.get(n)

If we combine both excerpts above and run using as input the GATA3 model the result would like

1
[ACGT] [ACGT] [AG] [ACGT]

which is five nucleotides short of the motif length. How to correct it? Pretty simple we just add to the dictionary the “regular” nucleotide codes

1
2
3
4
iupacdict = {'A':'A', 'C':'C', 'G':'G', 'T':'T',
'M':'[AC]', 'R':'[AG]', 'W':'[AT]', 'S':'[CG]', 'Y':'[CT]', 'K':'[GT]',
'V':'[ACG]', 'H':'[ACT]', 'D':'[AGT]', 'B':'[CGT]', 'X':'[ACGT]',
'N':'[ACGT]'}

This would solve our “problem” without adding a line of code to the final script. Fastforwarding to creating the regex

1
2
3
4
5
6
7
8
9
mregex = ''
for n in motif:
  mregex += iupacdict.get(n)
  
print mregex #just to check 

tosearch = re.compile(str(mregex))
for i in tosearch.findall(sequencefile):
  print i

Simple and quick. The final output is not really elaborated, but we can improve it. Next time.