August 6, 2019

FROM BLAST TO BEAST

Building Phylogenetic Trees from Sequence Data

The NCBI BLAST web service is awesome, but it can be difficult to set your exact algorithm parameters and parse the output to use in evolutionary analyses. In this tutorial, I’ll be taking you from a single sequence of interest all the way through construction of a phylogenetic tree using BEAST2. I’m not an expert on this by any means, but this is just the tutorial I would have liked to read as I started on this project. Please contact me with any feedback. (Note: This tutorial is for nucleotide sequences. For protein sequences, see the section at the end.)

System Requirements

I’m running Mac OS Mojave with Python 3 and BEAST2. I use Atom with the Hydrogen package so I can easily run lines as in a Jupyter notebook.

Installing Biopython

I’ll be using the Biopython package to parse my BLAST results and automate things. Install with:

> pip install biopython

And import with:

from Bio import ... as ...

Running BLAST online with Biopython

The next step is to create a FASTA file or a string with the sequence you want to search in it. Read the file in (or assign the string to) a variable. We call qblast() in the Bio.Blast.NCBIWWW module, which takes the type of search, the online database, and the sequence:

from Bio.Blast import NCBIWWW
result = NCBIWWW.qblast(“blastn”, “nt”, sequence)

In this case, we’re running a BLASTn search on the nucleotide database. The other abbreviations can be found in the BLAST guide (think blastp, tblastx, etc.). You can also change the format of the results. The default format is XML, which we will use below.

Parsing BLAST output

Once we have our BLAST results, we save the XML file. This is mostly so we have a file to go back to and don’t need to run the BLAST search again in case of disaster.

with open ('results.xml', 'w') as save_file:
    blast_result  = result.read()
    save_file.write(blast_result)

Then, we will import the SearchIO and SeqIO submodules and use them to parse the XML file and create a FASTA file. (Note: There is also a BLAST-specific XML parser in biopython, but it creates a BLAST record class, which isn’t what we want here.)

from Bio import SearchIO
from Bio import SeqIO

blast_result = SearchIO.read('results.xml', 'blast-xml')
records = []
for hit in blast_result:
    records.append(hit[0].hit)
    SeqIO.write(records, 'results.fa', 'fasta')

Converting using SeqMagick

Unfortunately, Biopython doesn’t seem to have an easy way to create a Nexus file, which is a tree file for BEAST. Instead, we can use the package seqmagick. (Documentation here.) To install, just use:

> pip install seqmagick

Seqmagick has a helpful convert function which automatically parses several file extensions including .fa (FASTA) and .nex (Nexus). For Nexus files, the –alphabet flag is needed to specify DNA, RNA, etc. We run the following line in Terminal:

> seqmagick convert --alphabet dna "results.fa" "results.nex"

If everything goes well, you’ll now have a nice clean nexus file like the one below. Make sure all the aligned sequences are the same length.

Downloading and installing BEAST

(Note: Most of this section I learned from the book Bayesian Evolutionary Analysis with BEAST 2, by Alexei J. Drummond and Remco R. Bouckaert. I’d recommend reading at least the first 5 chapters for anyone unfamiliar with Bayesian analysis for trees.)

BEAST can be downloaded from the website here. The installation is simple, but once its complete, make sure you also have the Bayesian Evolutionary Analysis Utility, or BEAUti, installed as well.

Setting model parameters with BEAUti

First, load the Nexus file you created using seqmagick into BEAUti by selecting File > Import Alignment. In my case, I had only one partition; however, for protein sequences multiple partitions (coding/noncoding, etc.) may appear. If you are working with multiple sample dates, you can include that information in the Tip Dates tab; however, phage and bacterial evolution rates are slow enough that we will treat all samples as contemporaneous.

Select the Site Models tab and set the Gamma Category Count parameter to 4 (an appropriate range is usually 4-6). Now select a model from the Subst. Model dropdown menu that fits the data. In this case, I’ll be using HKY.

To set the clock model, go to the corresponding tab and select the model that best fits your data. If you’re not sure how clock-like the rate of evolution is, run the analysis with a random local clock model, which considers each branch in the tree with a separate clock rate.

Select the Priors tab. In this case, I will be using a Yule model as my data comes from different species. This is a pretty complex process which I won’t cover in detail here, but the BEAST book and tutorials are helpful!

The MCMC tab determines the general parameters for the BEAST run. The chain length gives the number of steps the Monte Carlo algorithm will make before finishing. I set this to 1,000,000 to reduce the run time. (Note: If you’re not sure about your prior choices, doing a run with a chain length as short as 10,000 can give you an idea of the results, especially if you’re running on an older computer.)

Save the file as “[filename].xml”. We are now ready to run the file through BEAST.

Running BEAST

Congrats, we’ve almost made it to the end of this tutorial! Now just open BEAST, select the XML file you created, and hit run. No need to change the random number seed etc. A log window will pop up and run for a few minutes. Once it’s done, you’ll see a list of parameters like this:

To visualize the tree, you can use DensiTree or FigTree, both of which are available online or come with the BEAST download. Just open the program and import the .trees file BLAST spit out, and you’ll get something like this:

I like checking my results in DensiTree quickly to see if the parameters I chose were reasonable. For more information on visualizing trees, see the DensiTree manuals here.

I won’t go into the complete analysis of BEAST results, but Tracer (found here) is useful for examining the underlying distributions and more specific descriptive statistics about your tree. Good luck!

Note on protein sequences

Protein sequences can be downloaded from BLASTp or BLASTx results as an XML, converted to a FASTA with the script above, and converted into a multiple alignment using COBALT (here). COBALT Nexus files can then be directly inputted into BEAUTi. Yes, this is much easier than the nucleotide route. Yes, I found this after doing all the above work and was very frustrated. But hey! We did some learning along the way.

built by charlotte merzbacher