Zientzilaria

Covering the bioinformatics niche and much more

Hands on Code: Sequences and Strings - Part I

| Comments

As pointed in Beginning Perl for Bioinformatics, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. As is largely known DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case some letters represent one or more nucleotides that can be present at some sequence position (click here for more ).

So, as the amino acid is the basic building block of proteins (AKA polypeptides), strings containing sequence is our most basic block, from where all the bioinformatics magic will work on. Usually in Perl a string is represented by the dollar sign in front of the variable name, like this $sequence. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter

1
mydna="ACGTACGTACGTACGTACGTACGT"

a quick note: Python can be used with the interpreter command line or by previously saved scripts. I will try to use the latter in the code examples.

OK, we are ready to create our first Bioinformatics Python Hello World script. Let’s get the sequence above and print it on the screen. The first line will tell the operating system what to use and where to find the Python interpreter

1
#! /usr/bin/env python

Next we will create the variable myDNA and assign the corresponding sequence

1
myDNA = "ACGTACGTACGTACGTACGTACGT"

And finally, we will print the contents of the variable to the screen:

1
print myDNA

As mentioned above, Python mandates that you have your code indented, but in our final script this is not needed:

1
2
#!/usr/bin/env 
python myDNA = "ACGTACGTACGTACGTACGTACGT" print myDNA

The first line tells your operating system that this is a Python script and to use the interpreter located in that directory; line two declares a variable called myDNA and assigns the sequence string to it and the last line simply output this variable to the screen. That simple! To run this (extremely simple) script you can copy and paste the code above to your favourite text editor save the file with a .py extension (recommended but not necessary). To run the script, as long as you have Python installed, just open a shell and type on the command line:

1
> python code_01.py