Zientzilaria

Covering the bioinformatics niche and much more

The Regular Expression

| Comments

As mentioned above, regex in Python are provided by the re module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use. Let’s start with a DNA sequence.

1
myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'

How to transcribe it to RNA? Transcription creates a single-strand RNA molecule from the double-strand DNA; basically the final result is a similar sequence, with all T’s changed to U’s. So our regular expression has to find all T nucleotides in the above sequence and then replace them. Regular expressions in Python need to be compiled into a RegexObject, that contains all possible regular expression operations. In our case we need to search and replace, what can be done by using the sub() method. According to the Python’s Regular Expression HOWTO sub() returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn’t found, string is returned unchanged. Let’s put everything above in real code. First we need to compile a new RegexObject that will search for all thymines in our sequence. It can be achieved by using this:

1
regexp = re.compile('T')

Simple as that. This line of code tells the Python interpreter that our “regular expression” is every T in our string. Now, we have to make replace those Ts with Us. In order to do that we just tell the interpreter:

1
myRNA = regexp.sub('U', myDNA)

Let’s look at the last two lines of code. On the first line we created a new RegexObject, regexp (that could have any name, as any variable) and compiled it, making our regular expression to be every T in our string. On the second line, we assigned our soon to be created RNA sequence to a new string (remember that strings in Python are immutable) and used the command sub to replace in the Ts by Us present in our original DNA string. Putting all together our transcription code will be

1
2
3
4
5
6
\#! /usr/bin/env python 
import re
myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'
regexp = re.compile('T')
myRNA = regexp.sub('U', myDNA)
print myRNA