We come to the penultimate part of our restriction enzyme site finder.
Just a couple of pieces lacking in the puzzle and we are there. First,
the most important: the function that searches for the sites, using
regex patterns. We called it
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
We use the IUPAC dictionary created previously to translate the nucleotides entries in the restriction enzyme file. The function also receives three values: the input name of the selected enzyme, the dictionary with all the enzymes and sites and the sequence where to search. We could easily remove one of those, but let’s leave it there.
First we get the site
from the dictionary and initialize an empty string to receive the patter
and a empty list to receive the positions. We will see why we don’t need
an empty list to store the found sites. We then iterate over the site
and create a pattern using the values for each letter of the site
(dictionary key). Created the patter, we compile the regex and with
findall we find every entry of the site in the sequence. As we already
have seen, using the regex
findall will generate a list with all the
entries for that particular regex in the string we are searching. This
is pretty handy because some enzymes have degenerated restriction sites.
That’s why we don’t need a pre-initialized empty list for the sites.
Then we use the
finditer to find the the exact position of each one of
the sites. Each iterator is a tuple with a start and end positions. In
our case we only need the start position, so in a small loop we iterate
over the temporary variable and just append to the positions empty list
the start value. We have two integers,
end that receive
the values from
i.span, but we only use
begin. The function then
returns two lists as a tuple: one for the sites and one for the
positions. If our programming is correct, both lists should have the
same size and are ready to generate the output.