August 6, 2019
FROM BLAST TO BEAST
Building Phylogenetic Trees from Sequence Data
The NCBI BLAST web service is awesome, but it can be difficult to
set your exact algorithm parameters and parse the output to use in
evolutionary analyses. In this tutorial, I’ll be taking you from a
single sequence of interest all the way through construction of a
phylogenetic tree using BEAST2. I’m not an expert on this by any
means, but this is just the tutorial I would have liked to read as I
started on this project. Please contact me with any feedback.
(Note: This tutorial is for nucleotide sequences. For protein
sequences, see the section at the end.)
System Requirements
I’m running Mac OS Mojave with Python 3 and BEAST2. I use Atom with
the Hydrogen package so I can easily run lines as in a Jupyter
notebook.
Installing Biopython
I’ll be using the Biopython package to parse my
BLAST results and automate things. Install with:
> pip install biopython
And import with:
from Bio import ... as ...
Running BLAST online with Biopython
The next step is to create a FASTA file or a string with the
sequence you want to search in it. Read the file in (or assign the
string to) a variable. We call qblast() in the Bio.Blast.NCBIWWW
module, which takes the type of search, the online database, and the
sequence:
from Bio.Blast import NCBIWWW
result = NCBIWWW.qblast(“blastn”, “nt”, sequence)
In this case, we’re running a BLASTn search on the nucleotide
database. The other abbreviations can be found in the BLAST guide
(think blastp, tblastx, etc.). You can also change the format of the
results. The default format is XML, which we will use below.
Parsing BLAST output
Once we have our BLAST results, we save the XML file. This is mostly
so we have a file to go back to and don’t need to run the BLAST search
again in case of disaster.
with open ('results.xml', 'w') as save_file:
blast_result = result.read()
save_file.write(blast_result)
Then, we will import the SearchIO and SeqIO submodules and use them
to parse the XML file and create a FASTA file. (Note: There is also a
BLAST-specific XML parser in biopython, but it creates a BLAST record
class, which isn’t what we want here.)
from Bio import SearchIO
from Bio import SeqIO
blast_result = SearchIO.read('results.xml', 'blast-xml')
records = []
for hit in blast_result:
records.append(hit[0].hit)
SeqIO.write(records, 'results.fa', 'fasta')
Converting using SeqMagick
Unfortunately, Biopython doesn’t seem to have an easy way to create a
Nexus file, which is a tree file for BEAST. Instead, we can use the
package seqmagick. (Documentation here.) To install, just use:
> pip install seqmagick
Seqmagick has a helpful convert function which automatically parses
several file extensions including .fa (FASTA) and .nex (Nexus). For
Nexus files, the –alphabet flag is needed to specify DNA, RNA, etc.
We run the following line in Terminal:
> seqmagick convert --alphabet dna "results.fa" "results.nex"
If everything goes well, you’ll now have a nice clean nexus file
like the one below. Make sure all the aligned sequences are the same
length.
Downloading and installing BEAST
(Note: Most of this section I learned from the book Bayesian
Evolutionary Analysis with BEAST 2, by Alexei J. Drummond and Remco R.
Bouckaert. I’d recommend reading at least the first 5 chapters for
anyone unfamiliar with Bayesian analysis for trees.)
BEAST can be downloaded from the website here. The installation is
simple, but once its complete, make sure you also have the Bayesian
Evolutionary Analysis Utility, or BEAUti, installed as well.
Setting model parameters with BEAUti
First, load the Nexus file you created using seqmagick into BEAUti by
selecting File > Import Alignment. In my case, I had only one
partition; however, for protein sequences multiple partitions
(coding/noncoding, etc.) may appear. If you are working with multiple
sample dates, you can include that information in the Tip Dates tab;
however, phage and bacterial evolution rates are slow enough that
we will treat all samples as contemporaneous.
Select the Site Models tab and set the Gamma Category Count parameter
to 4 (an appropriate range is usually 4-6). Now select a model from
the Subst. Model dropdown menu that fits the data. In this case, I’ll
be using HKY.
To set the clock model, go to the corresponding tab and select the
model that best fits your data. If you’re not sure how clock-like the
rate of evolution is, run the analysis with a random local clock
model, which considers each branch in the tree with a separate clock
rate.
Select the Priors tab. In this case, I will be using a Yule model as
my data comes from different species. This is a pretty complex
process which I won’t cover in detail here, but the BEAST book and
tutorials are helpful!
The MCMC tab determines the general parameters for the BEAST run. The
chain length gives the number of steps the Monte Carlo algorithm will
make before finishing. I set this to 1,000,000 to reduce the run
time. (Note: If you’re not sure about your prior choices, doing a
run with a chain length as short as 10,000 can give you an idea of
the results, especially if you’re running on an older computer.)
Save the file as “[filename].xml”. We are now ready to run the file
through BEAST.
Running BEAST
Congrats, we’ve almost made it to the end of this tutorial! Now just
open BEAST, select the XML file you created, and hit run. No need to
change the random number seed etc. A log window will pop up and run
for a few minutes. Once it’s done, you’ll see a list of parameters
like this:
To visualize the tree, you can use DensiTree or FigTree, both of
which are available online or come with the BEAST download. Just open
the program and import the .trees file BLAST spit out, and you’ll get
something like this:
I like checking my results in DensiTree quickly to see if the
parameters I chose were reasonable. For more information on
visualizing trees, see the DensiTree manuals here.
I won’t go into the complete analysis of BEAST results, but Tracer
(found here)
is useful for examining the underlying distributions and
more specific descriptive statistics about your tree. Good luck!
Note on protein sequences
Protein sequences can be downloaded from BLASTp or BLASTx results as
an XML, converted to a FASTA with the script above, and converted into
a multiple alignment using COBALT (here). COBALT Nexus files can then
be directly inputted into BEAUTi. Yes, this is much easier than the
nucleotide route. Yes, I found this after doing all the above work
and was very frustrated. But hey! We did some learning along the
way.
built by charlotte merzbacher