Science and technology

Evolution

Evolution — of species, and of the gene families encoding the proteins that do the work of the cell — is at the center of life. By modeling evolution explicitly in the analysis of gene families, we can predict changes in protein function and structure. I used these techniques — called phylogenomics — in the annotation of the human genome reported in Science. (I’m one of over a hundred co-authors on that paper, but my algorithms played a primary role in the precision of the predicted functions provided in the Celera Genomics annotation system.)

The field of phylogenomics involves two related challenges: reconstructing phylogenies (evolutionary trees) for protein superfamilies (such as globins and 7-transmembrane receptors, with multiple variants in individual genomes) and reconstructing species phylogenies (displaying the evolutionary relationships between species) using multiple genes from each genome. My lab primarily worked on the first challenge, as it was directly related to the problem of predicting gene (and protein) function and structure from the protein sequence alone.

Finally, we extended these phylogenomic approaches to include information from protein 3D structure, an approach we called structural phylogenomics.

Machine learning techniques for protein 3D structure and function prediction

When I was a PhD student at UC Santa Cruz, protein 3D structure prediction from the amino acid sequence alone was the hottest challenge for computer scientists like myself who wanted to do something relevant in biology. I was fortunate to be in a group headed by a brilliant computer scientist — Dr. David Haussler — who realized we could take methods developed for speech recognition and apply them to analysis of protein sequences. The machine learning methods we developed in the Haussler lab used Hidden Markov models (HMMs) to construct statistical models of protein families and domains, with success at the Second Critical Assessment of Protein Structure Prediction (CASP) competition.

It was during my work on protein structure prediction that I began working on the problem of reconstructing the evolutionary histories of protein superfamilies, and machine learning techniques to identify functional subfamilies.(US Patent No. 6128587A)

I continued this work after completing my PhD. First, at Celera Genomics, where the SCI-PHY algorithm I developed for my PhD thesis was used in the functional annotation of the human genome. As a professor at Berkeley, I extended these methods and hosted the PhyloFacts webservers and databases for the scientific community. These included: FlowerPower clustering of proteins, the SATCHMO and SATCHMO-JS algorithms to construct multiple sequence alignments and phylogenetic trees simultaneously using HMMs, PHOG orthology prediction, subfamily HMM construction, and algorithms for protein functional site prediction (INTREPID, ResBoost and Discern).

PhyloFacts database construction pipeline

Individual algorithms in red text

**Protein superfamily tree from PhyloFacts database**

Phylogenetic (evolutionary) tree for Beta adrenergic receptors produced using the SATCHMO algorithm — Phylogenetic tree and multiple sequence alignment for Beta adrenergic receptors using the SATCHMO algorithm

Predicting protein functional sites

To paraphrase George Orwell’s Animal Farm proclamation that some animals are more equal than others, some positions in proteins are more important than others. We can use information from the protein structure as well as conservation patterns across related proteins in other species to identify positions where small changes can cause dramatic shifts in function. My lab developed several methods for this task, including (INTREPID, ResBoost and Discern).