Hal: a program for semi-automated phylogenomic analyses


B. Robbertse, J. Reeves, C. Schoch and J.W. Spatafora

Dept. Botany and Plant Pathology

Oregon State University


Hal version 1.1. The current version of Hal is a set of scripts written in PERL v5.8.4 that takes all versus all BLASP text output (from any number of proteomes) as input to identify orthologs, extract orthologous sequences and concatenate them into a super alignment.  These steps are achieved by linking existing programs such as Tribe MCL and Clustal W with PERL scripts.  A flow diagram of the method is presented in Figure 1 and a more detailed description along with a phylogenomic analysis of ascomycete fungi will be published soon.  Additional modified scripts were also developed to parse sets of TBLASTN and BLASTX searches in order to extract orthologous sequences from unannotated genomes and EST libraries. Figure 2 is an example of the phylogenetic relationships between 17 taxa when using a super alignment produced by Hal from annotated genomes.  Several fungal genomes are being sequenced and it is our goal to expand the number of taxa in the analysis to incorporate all sequenced genomes in the kingdom Fungi.


Identification of orthologs. Identification of orthologs begins with an all versus all BLASTP search (Altschul et al., 1990) performed on annotated genomes. Blast results are then analyzed with the program Tribe-MCL (Enright et al., 2002; van Dongen, 2000), which utilizes Markov Clustering (MCL) by creating a similarity matrix from e-values and then clusters proteins into related groups. The main parameter that influences the size of a cluster in Tribe-MCL is the inflation value, which can be adjusted from 1.1 (producing fewer cluster but with more members) to 5.0 (producing more clusters with fewer members). Using a range of settings for the inflation value (1.1 to 5.0) and filtering for clusters that consisted of one protein from each proteome in the analysis we are able to find putative orthologs. Only clusters of orthologs with best hits to a protein within the same cluster, are considered for phylogenetic analysis. The protein sequences used for the phylogenetic analysis are those reported for each protein when compared to its best hit in the BLASTP results.


Alignments and model assignment. Using Perl scripts   (http://spatafora.science.oregonstate.edu/node/view/145), identified sequences are extracted from the BLASTP report, concatenated, parsed into fasta format and each cluster of orthologous protein sequences are aligned with ClustalW (Thompson et al., 1997). All aligned orthologous clusters are concatenated into a single alignment and converted into Phylip format. ProtTest v1.2.6 (Abascal et al., 2005) is used to estimate the best model of evolution for each set of orthologs and the concatenated super alignment. Gap-containing columns are excluded from phylogenetic analyses. 


Phylogenetic analyses. Tree construction can be performed using standard phylogenetic algorithms, including maximum likelihood (ML) with Phyml (Guindon and Gascuel, 2003), maximum parsimony (MP) in  Paup*v10b (Swofford, 2002), and neighbor joining (NJ) distance analyses using PHYLIP v3.65 (Felsenstein, 1981). Models used in ML analyses are chosen as either the best fit model for the super alignment or as the most frequent model from individual orthologs. Support for the nodes of the phylogenetic tree are estimated using 100 bootstrap replicates.


Figure 1.  A flowchart showing the steps taken to produce a super alignment by linking existing programs (grey ovals) with perl scripts (yellow boxes) which feed in/output (white boxes) to the next step.  The parsing of BLASTP output into a super alignment is fully automated and performed by the program Hal, which takes approximately 4 hours to complete the task.


Figure 2.  A tree showing phylogenetic relationships using a super alignment from 781 orthologous amino acid sequences of 17 fungal taxa and analyzed by maximum  parsimony. Numbers at the nodes are bootstrap partitions from 100 nonparametric bootstrap replications.



Abascal, F., et al., 2005. ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 21, 2104-2105.

Altschul, S. F., et al., 1990. Basic local alignment search tool. J Mol Biol. 215, 403-10.

Enright, A. J., et al., 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575-84.

Felsenstein, J., 1981. PHYLIP: Phylogeny inference package (version 3.2). Cladistics. 5, 164-166.

Guindon, S., Gascuel, O., 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52, 696-704.

Swofford, D. L., PAUP*. Phylogenetic analysis using parsimony (*and other methods). Sinauer Associates, Sunderland, Massachusetts, U.S.A., 2002.

Thompson, J. D., et al., 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25, 4876-82.

van Dongen, S., Graph Clustering by Flow Simulation. University of Utrecht, Utrecht, 2000.