The total number of different proteins existing today is estimated to be a trillion. Although this may seem a vast number, the actual diversity of proteins in nature is rather limited. Many proteins share detectable similarity in sequence and structure, since they arose by amplification, recombination, and divergence from a basic complement of autonomously folding modules, referred to as domains. Indeed, sequence comparison of modern proteins shows that they fall into only about ten thousand domain families, which, based on structural similarity, can be grouped further into one of a thousand folds. Many of these folds were already established at the time of the Last Universal Common Ancestor, a theoretical primordial organism from which all life on earth descended.
We are broadly interested in understanding the events that led to the emergence of these first folds as well as the events that led to their diversification into the many functional protein families we recognize today. To track these events, we use sensitive sequence analysis tools to establish correlations between sequence and structure similarity of today’s proteins. Many of the tools we use are integrated into the MPI Bioinformatics Toolkit (http://toolkit.tuebingen.mpg.de), a one-stop, integrative resource for protein bioinformatic analysis, which we develop and maintain.