AlexKnyshov

my GitHub Web Site

Python tutorial 1: Reverse complementing a sequence

Legend

Code block
Output block
Console command block

Tools needed

  • Python 2
  • Biopython
  • Bash

Reverse complementing a sequence

Here is a simple script to reverse complement a sequence using Biopython. I am typing in the sequence for simplicity.

from Bio.Seq import Seq #load the module
testseq = Seq("ATGCATGC") #input sequence
newseq = testseq.reverse_complement() #revcom
print newseq #print the result
GCATGCAT

We can make the script more helpful by loading the sequence from the file and outputting the sequence into another file. We will use the module SeqIO, and because the sequence objects will be created by SeqIO, there is no need for Seq module anymore.

from Bio import SeqIO
fname = "test.fas" #this is our input file name
fhandle = open(fname, "rU") #open file for reading
seqs = SeqIO.parse(fhandle, "fasta") #parse the sequences
fnew = open(fname+".revcom.fas", "w") #open file with modified file name for output
for seq in seqs: #convert sequences within a for loop and write them in the output file
    seq.seq = seq.seq.reverse_complement()
    SeqIO.write(seq, fnew, "fasta")
fnew.close() #close the output file
fhandle.close() #close the input file

You could shorten the script by using with or even giving the parse and write commands file names directly so the biopython is taking care of file opening and closing. For writing however we need to make sure each write call doesn’t erase the previous output:

from Bio import SeqIO
fname = "test.fas" #this is our input file name
seqs = SeqIO.parse(fname, "fasta") #parse the sequences
with open(fname+".revcom.fas", "w") as output_handle: #convert sequences within a for loop and write them in the output file
  for seq in seqs:
    seq.seq = seq.seq.reverse_complement()
    SeqIO.write(seq, output_handle, "fasta")

It could be even better to be able to supply the input file name as a command line parameter, so the script is more generic and can be integrated into other workflows. For this I use the module sys to parse the command line arguments, script file name is the first argument (positional index 0), input file name thus will have index 1.

from Bio import SeqIO
import sys
fname = sys.argv[1] #this is our input file as a command line parameter
seqs = SeqIO.parse(fname, "fasta") #parse the sequences
with open(fname+".revcom.fas", "w") as output_handle: #convert sequences within a for loop and write them in the output file
  for seq in seqs:
    seq.seq = seq.seq.reverse_complement()
    SeqIO.write(seq, output_handle, "fasta")

If you try to run it interactively you will get an error message. Instead, I save the above shown script to a file script1.py and run in terminal like so

python script1.py test.fas

Sometimes it is beneficial to output data to standard output such that you have a greater flexibility. We will need to modify our script as follows and save it to our script1.py file:

from Bio import SeqIO
import sys
fname = sys.argv[1] #this is our input file as a command line parameter
seqs = SeqIO.parse(fname, "fasta") #parse the sequences
for seq in seqs:
    seq.seq = seq.seq.reverse_complement()
    SeqIO.write(seq, sys.stdout, "fasta")

Here I indicate sys.stdout as a handle meaning it will be output to console. In this situation I do not need with statement. Then I run the script as before and see the output:

python script1.py test.fas
>t1
aacaacattaaataactgatcatttccaatagttgacaggatttcttaattcaatacgaa
>t2
acaacattaaataactgatcatttccaatagttgatccaggatttccgaataattcatct
>t3
aatgcgtgagatgttacaataacgttaaatagttggtcatttccaatagttgatccaggg
>t4
gatgttacaataacgttaaatagttggtcatttccaatagttgatccagggttttcatct
>t5
tagttgatcattcccaatagttgttccaggatttcttaattcaattcggataattcatct
>t6
ttactataaaaaaaattattgtaaatgcgtgtgatgtaacaacaacattaaatagttgat
>t7
caattacattatataattgatcatttccaatggttgatccactattcgaatgattcatct
>t8
gatgttacaattacattaaataattggtcactcccaatagttgatcctggatttccatct
>t9
acaattacattgaataactgatcatttctaagggttgatcccggatttcttaattcttct
>t10
tgtgctgttacaattacattgtaaatttgatcatttccgataaatgggcctaattcatct
>t11
catgtgctgttacaattacattgtaaatttgatcatttccgattaattcaatacgaataa
>t12
aaatttgatcatttccgataaatgggcctgggtgtcttaattcaatacgaataatcatct
>t13
cattgtaaatttgatcatttccgataaatgggcctgggtgtcttaattcaatacgaataa
>t14
ttgtagatttgatcatttctaataattggtcctggtcttcttaattcaattcgaataatt

To actually record the output to the file, I do the following:

python script1.py test.fas > test.fas.revcom.fas

Finally, if I have multiple files in the folder, I could run the script in a shell for loop. I am masking the files by extension instead of just putting asterisk, so that other non target files in my folder do not get processed:

for f in *.fas; do python script1.py $f > $f.revcom.fas; done

Done

Last edited on 25 February, 2019