Covering the bioinformatics niche and much more

Still on Merging Pfam Alignments ...

| Comments

One of the things I like about Python and the Python community is the search for the making code simple and clear. Tal left a comment in the last post about merging Pfam alignment sequences suggesting another approach to our problem. The code is below

def merge_seqs(data1, data2):
    from itertools import chain, groupby
    format = "%s-%s->%d\n%s%s"
    flist = []
    keyfunc = lambda it: it.name[it.name.find('|') + 1 : it.name.find('/')]
    for it, g in groupby(sorted(chain(data1, data2), key=keyfunc), keyfunc):
        values = list(g)
        if len(values) == 2:
            jname, jseq = values[0].name, values[0].sequence
            kname, kseq = values[1].name, values[1].sequence
            flist.append(format % (jname, kname, len(jseq), jseq, kseq) )

    return flist

The code also uses the itertools module, importing chains and groupby. We already saw chains in the previous post, but groupby is new to us here. groupby was introduced in the 2.4 version of Python and is a method returns keys and groups from an iterable. An Python iterable is any object that can return its elements at given time, for instance in a for loop, while the index of this loop is the iterator. So, in our case groupby will return the sequence names based on the lambda function defined before the groupby and the chain method. Usually groupby has this syntax

groupby(iterable[, key])

The key is optional, and in our case it is the lambda function. Another method new to use that uses the same lambda function is sorted. As its name hints, sorted returns a sorted list of iterables. The key in this case is the sorting algorithm, that actually creates the comparison between items. Basically in the code above, a lambda function extracts the desired regions from the sequence names, which are them iterated in a groupby method that returns they key values, one value when the sequence is unique, two values when there are two sequences, of a sorted iterable generated by a chain that read both input lists in one pass. After this we just need to check the number of returned values and we have our list of matching sequences.