After a long delay, we are back. Before entering in the next topic, Restriction Enzymes, let’s take a look how to create a regex pattern from user input and the dictionary of IUPAC code for nucleotides. We will use the same dictionary from the previous entry
1 2 3 |
|
and the same consensus sequence of the GATA3 binding site NNGATARNG This consensus sequence will be provided as a script parameter, along with the filename
1 2 3 4 |
|
First line reads the sequence file from the
user input and second line stores the input motif in a string. Now we
have to get the motif string and check for each letter (IUPAC code) and
get the correspondent set of nucleotides. For this we will use a loop
and a the method get
method of Python dictionaries. This method, as
its name implies, gets the value of the key
in parentheses. Like this
1 2 |
|
If we combine both excerpts above and run using as input the GATA3 model the result would like
1
|
|
which is five nucleotides short of the motif length. How to correct it? Pretty simple we just add to the dictionary the “regular” nucleotide codes
1 2 3 4 |
|
This would solve our “problem” without adding a line of code to the final script. Fastforwarding to creating the regex
1 2 3 4 5 6 7 8 9 |
|
Simple and quick. The final output is not really elaborated, but we can improve it. Next time.