Zientzilaria

Covering the bioinformatics niche and much more

Obtaining Overrepresented Motifs in DNA Sequences, Part 11

| Comments

After a long hiatus we are (almost) back on track in order to get our scripts to determine overrepresented motifs in DNA sequences. Last time we checked we defined the “best” factorial function in Python

1
2
3
4
5
def fac_01(n):
    result = 1
    for i in xrange(2, n+1):
        result *= i
    return result

and Andrew Dalke sent a couple of links pointing out to a binomial calculation function, one of them is below

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
\# This file contains the Python code from Program 14.10 of
# "Data Structures and Algorithms with Object-Oriented Design Patterns in Python" 
# by Bruno R. Preiss. 
## Copyright (c) 2003 by Bruno R. Preiss, P.Eng. All rights reserved. 
## http://www.brpreiss.com/books/opus7/programs/pgm14_10.txt # 

def binom(n, m):
  b = [0] * (n + 1)
  b[0] = 1
  for i in xrange(1, n + 1):
      b[i] = 1
      j = i - 1
      while j > 0:
          b[j] += b[j - 1]
          j -= 1
  
  return b[m]

There is a similar implementation in the Python Cookbook online and it is clearly more convenient using this function than actually coding to calculate an identical value using factorials. But anyway, let’s see how a function that calculates one binomial coefficient from the Hypergeometric Distribution (HD) by using the factorial function. Later we benchmark both methods. Each binomial coefficient can be expanded as image!%7D) and the HD has three of them. From the Wikipedia “In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.” In the next post we will define each term of the HD for the motifs case. In each binomial expansion we would have to calculate three factorial values, nine in total. With the binomial function, only three values need to be calculated. So, using the factorial function we would need to code something like this in order to calculate one of the binomials

1
2
3
4
5
6
7

#let's say the motif quorum in the foreground is 7
#and the total number of sequences is 3000 
#we won't touch the other required values 
fore = 7 total = 3000
hd = fac_01(3000) / (fac_01(7) * fac_01(2993)
print hd

Next, we will benchmark and see if there is an advantage to either method.