my GitHub Web Site
|
|
|
Here is a simple script to reverse complement a sequence using Biopython. I am typing in the sequence for simplicity.
from Bio.Seq import Seq #load the module
testseq = Seq("ATGCATGC") #input sequence
newseq = testseq.reverse_complement() #revcom
print newseq #print the result
GCATGCAT
We can make the script more helpful by loading the sequence from the file and outputting the sequence into another file. We will use the module SeqIO, and because the sequence objects will be created by SeqIO, there is no need for Seq module anymore.
from Bio import SeqIO
fname = "test.fas" #this is our input file name
fhandle = open(fname, "rU") #open file for reading
seqs = SeqIO.parse(fhandle, "fasta") #parse the sequences
fnew = open(fname+".revcom.fas", "w") #open file with modified file name for output
for seq in seqs: #convert sequences within a for loop and write them in the output file
seq.seq = seq.seq.reverse_complement()
SeqIO.write(seq, fnew, "fasta")
fnew.close() #close the output file
fhandle.close() #close the input file
You could shorten the script by using with
or even giving the parse
and write
commands file names directly so the biopython is taking care of file opening and closing. For writing however we need to make sure each write call doesn’t erase the previous output:
from Bio import SeqIO
fname = "test.fas" #this is our input file name
seqs = SeqIO.parse(fname, "fasta") #parse the sequences
with open(fname+".revcom.fas", "w") as output_handle: #convert sequences within a for loop and write them in the output file
for seq in seqs:
seq.seq = seq.seq.reverse_complement()
SeqIO.write(seq, output_handle, "fasta")
It could be even better to be able to supply the input file name as a command line parameter, so the script is more generic and can be integrated into other workflows. For this I use the module sys to parse the command line arguments, script file name is the first argument (positional index 0), input file name thus will have index 1.
from Bio import SeqIO
import sys
fname = sys.argv[1] #this is our input file as a command line parameter
seqs = SeqIO.parse(fname, "fasta") #parse the sequences
with open(fname+".revcom.fas", "w") as output_handle: #convert sequences within a for loop and write them in the output file
for seq in seqs:
seq.seq = seq.seq.reverse_complement()
SeqIO.write(seq, output_handle, "fasta")
If you try to run it interactively you will get an error message. Instead, I save the above shown script to a file script1.py
and run in terminal like so
python script1.py test.fas
Sometimes it is beneficial to output data to standard output such that you have a greater flexibility. We will need to modify our script as follows and save it to our script1.py
file:
from Bio import SeqIO
import sys
fname = sys.argv[1] #this is our input file as a command line parameter
seqs = SeqIO.parse(fname, "fasta") #parse the sequences
for seq in seqs:
seq.seq = seq.seq.reverse_complement()
SeqIO.write(seq, sys.stdout, "fasta")
Here I indicate sys.stdout as a handle meaning it will be output to console. In this situation I do not need with
statement. Then I run the script as before and see the output:
python script1.py test.fas
>t1
aacaacattaaataactgatcatttccaatagttgacaggatttcttaattcaatacgaa
>t2
acaacattaaataactgatcatttccaatagttgatccaggatttccgaataattcatct
>t3
aatgcgtgagatgttacaataacgttaaatagttggtcatttccaatagttgatccaggg
>t4
gatgttacaataacgttaaatagttggtcatttccaatagttgatccagggttttcatct
>t5
tagttgatcattcccaatagttgttccaggatttcttaattcaattcggataattcatct
>t6
ttactataaaaaaaattattgtaaatgcgtgtgatgtaacaacaacattaaatagttgat
>t7
caattacattatataattgatcatttccaatggttgatccactattcgaatgattcatct
>t8
gatgttacaattacattaaataattggtcactcccaatagttgatcctggatttccatct
>t9
acaattacattgaataactgatcatttctaagggttgatcccggatttcttaattcttct
>t10
tgtgctgttacaattacattgtaaatttgatcatttccgataaatgggcctaattcatct
>t11
catgtgctgttacaattacattgtaaatttgatcatttccgattaattcaatacgaataa
>t12
aaatttgatcatttccgataaatgggcctgggtgtcttaattcaatacgaataatcatct
>t13
cattgtaaatttgatcatttccgataaatgggcctgggtgtcttaattcaatacgaataa
>t14
ttgtagatttgatcatttctaataattggtcctggtcttcttaattcaattcgaataatt
To actually record the output to the file, I do the following:
python script1.py test.fas > test.fas.revcom.fas
Finally, if I have multiple files in the folder, I could run the script in a shell for loop. I am masking the files by extension instead of just putting asterisk, so that other non target files in my folder do not get processed:
for f in *.fas; do python script1.py $f > $f.revcom.fas; done
Done
Last edited on 25 February, 2019