Researchers apply computational power to their hunt for noncoding regulatory sequences By Jeremy L. Peirce
Scientists know that the regulatory elements that guide and control gene expression, for the most part, lie not within coding sequences but outside and between them. Now researchers are taking their search for these sequences genome-wide. And with hundreds of completed genomes in hand, and still more in the works, a full comprehension of regulation at a genomic level has become increasingly plausible.
Understanding noncoding elements is necessary to understand cellular and developmental processes at a molecular level. "Eventually one wants maps of the genome that show which sites are active in which cells, and how they [the sites] change as the cells differentiate," says Ian Dunham, senior investigator at the Wellcome Trust Sanger Institute, Cambridge, UK. For the moment, though, these crucial snippets of genetic information remain elusive prey.
Phylogenetic footprinting, a method that sifts functional regulatory elements from nonfunctional DNA, has become an increasingly popular tool. The name harkens back to DNAse footprinting, a low-throughput experimental technique used to detect functional transcription factor binding sites (TFBS). In DNAse footprinting, protein-bound regions are protected from DNAse digestion, creating a "footprint" in a sequencing gel. In the phylogenetic equivalent, regulatory elements are protected from random drift across evolutionary time by selection. Such sequences reveal themselves by their unexpectedly high homology when compared to orthologs, implying slower evolution.
Before the advent of readily available genomic sequences and computational techniques, investigators often defined regulatory regions using DNAse footprinting and so-called promoter bashing. Promoter bashing involves fusing a series of truncated promoter fragments to a reporter gene, introducing the constructs into cells, and evaluating changes in expression. Newer approaches promise to largely eliminate these laborious procedures.
According to Wyeth Wasserman, associate professor of medical genetics at the University of British Columbia, phylogenetic footprinting is one of two informatic approaches at researchers' disposal, the other being module detection. Phylogenetic footprinting, he says, "can eliminate about 90% of false predictions while keeping most of the true ones." But module detection is even better, he adds. "If you know which transcription factors you're interested in, and you have enough data to know what they bind to, there are now good methods to look at clusters of binding sites [modules] and tell which ones are most likely to be real. We can eliminate about 99% of false positives this way."
Though studying combinations of binding sites generally is preferable to phylogenetic footprinting, Wasserman says, it's not always tractable. "The challenge is that we seldom have enough data to make those models." Phylogenetic footprinting, on the other hand, can be applied in the absence of any knowledge of the biology involved, so it is more widely applicable.
FUNCTION AND CONSERVATION
Phylogenetic footprinting involves the aligning of orthologous sequences (from equivalent genes in different species) to find noncoding regions (i.e., one or more TFBS) that have withstood the ravages of evolutionary time. Since TFBS are short and often degenerate, they are difficult to identify directly. Instead, longer conserved regions are identified and examined. The working assumption at this resolution is that functional elements should reside in conserved regions.
This assumption may not apply equally to all systems, however. Using phylogenetic footprinting to examine the well-studied regulatory regions of fly early-patterning genes, Eric Siggia of Rockefeller University in New York found that "simply filtering by interspecies conservation will give an incomplete account of experimentally known binding sites." Since only slightly more binding sites than expected by chance were also part of conserved regions, Siggia concludes that the effectiveness of phylogenetic footprinting at comprehensively identifying regulatory sequences "may depend very much on the system, and it remains to be shown that it gets all the regulation for a gene."1
Choosing the genomes to be compared is a major consideration in designing phylogenetic footprinting experiments. More evolutionarily distant species tend to share less nonconserved sequences and thus have more power to detect conservation. "If you're looking for a pattern that has been well conserved between all or many of the organisms, you would be more surprised if two distant species shared the pattern than if two closely related species shared the pattern," says Martin Tompa of the University of Washington, Seattle.
A balance between shared biology and evolutionary distance is necessary when choosing species for comparison. "If you are looking for conserved regions, you have to be confident that the organisms have a shared regulation," says Wasserman. "If you're doing a chicken-human comparison of an organ that's just not present in the chicken, even if you can find the right ortholog it may not be a great thing to do."
But, Wasserman continues, "Once you're sure about the orthologs and regulation, you want to maximize the distance that will allow you to get a good alignment. Right now that's a by-eye problem. I suspect in the not-so-distant future that will be much more quantitative, and you will throw in all the genomes." He concludes, "The bioinformatics has to catch up with the number of genomes."
So is there an optimal set of species to compare? Evidently not. According to Wasserman, sequence diversity also varies within genomes. "Genes are evolving at different rates, so there's no global statement that a pair of genomes are appropriate or inappropriate for comparison," he says. "You have to go by local characteristics."
Dario Boffelli, staff scientist at the Lawrence Berkeley National Laboratory in California, mentions another concern. "Conserved expression can be achieved in a number of different ways. After the split from their last common ancestor, each lineage may accumulate different changes and compensatory changes, but [the regulatory regions] may do the same thing. So human and mouse can be doing the same thing [i.e., using similar regulatory strategies], but with sequences you can't identify by conservation."
Comparing more closely related genomes could solve the problem. Boffelli and colleagues have developed a technique2,3 that partly overcomes the limitations accompanying closer comparisons. They used their method, phylogenetic shadowing, to compare multiple primates. "The basic ideas underlying phylogenetic shadowing and phylogenetic footprinting are very similar," Boffelli explains. "The idea specific to shadowing is the focus on species with a close phylogenetic relationship and on using a more sophisticated model [of conservation]." Shadowing assumes that important sequences will be strongly conserved among closely related species and eliminates less well-conserved regions.
Total tree length, a measure of the evolutionary distance between compared genomes and indirectly of experimental power, is approximately "the same between human and mouse and between human and primates with seven primates," Boffelli notes. The potential power of the technique is limited, however, by the overall diversity of primates. "If you sequence more you only see a very small increase," Boffelli explains. "Five to seven [sequences] capture about 85% of the variation."
While that number of sequences is fairly easy to acquire for individual genes, full genomes are another story. The chimpanzee genome is available and the macaque genome is expected soon. But according to Boffelli, "the chimp is too close to human to be of any use for shadowing." He adds, "What we are really missing are new-world monkey genomes. Since these are the most distant from human, they contribute the most to the analysis."
In addition to facilitating research on primate-specific genes, working with closely related species simplifies modeling. "Using closely related species means the [phylogenetic] trees have much higher reliability, and consequently all the mutation rate estimations are much more accurate," says Boffelli. This allows precise evaluation of the likelihood that a particular region has been conserved by evolutionary pressure, a difficult task in more distant comparisons. Boffelli contends that conserved functions more likely will be detected as conserved sequence because of less opportunity for divergence and compensatory change.
MORE IS BETTER
Even with an extensive collection of genomes, says Boffelli, "you can only compare the conserved biology. You often define the types of things you can discover based on the kind of organisms you study." For biology conserved between humans and other mammals, plenty of sequence diversity can be tapped. Stanford geneticist Gregory Cooper and colleagues4 estimate that approximately four times the diversity available in the human, mouse, and rat genomes--a goal reachable using fewer than 20 mammalian species, by Boffelli's estimation--would give phylogenetic footprinting experiments single-nucleotide resolution.
Elliott Margulies, a National Human Genome Research Institute (NHGRI) research fellow, and other researchers5 agree that more genomes will improve results. "We have done analyses that show you still make incremental gains in specificity for detecting sequences under purifying selection out to 16 and 17 species," Margulies notes.
Recently a group led by David Haussler, a Howard Hughes Medical Investigator at the University of California, Santa Cruz, used phylogenetic footprinting to identify 481 long, highly conserved regions in the human genome.6 Measuring 200 to 800 base pairs, these regions are perfectly conserved between humans, mice, and rats.6 Wasserman calls the observation "tremendously interesting."
These long stretches do not fit conventional biological paradigms of conservation. More than half of the ultraconserved regions are not associated with genes at all; the others often overlap both coding and noncoding regions, according to Haussler. Prior wisdom held that noncoding conserved regions are generally much shorter, because protein-binding sites tend to comprise only a few base pairs. "That's the mystery," Haussler says. "Why would they be conserved at such a high level over such a long [evolutionary] distance?" Indeed, 29 regions were entirely conserved between human and chicken, which are thought to have last shared a common ancestor an estimated 300 million years ago.
However, as postdoctoral fellow and first author Gill Bejerano notes, only the cores of some ultraconserved regions were present in fish, and Haussler's group "[was] not able to show the existence of ultraconserved regions for anything simpler than fish, including fly, sea squirt, and Caenorhabditis elegans." This, say both Haussler and Bejerano, suggests that the identified regions are particular to vertebrates, though other lineages may have their own analogous sequences. Bejerano says the Haussler group is collaborating with others to determine if any of these sequences function as distal enhancers of genes important in development.
ENCODING FUNCTIONAL ELEMENTS
Recently NHGRI launched an initiative, the Encyclopedia of DNA Elements (ENCODE), to identify all functional elements in 1% (30 million base pairs) of the human genome.7 Not surprisingly, phylogenetic footprinting plays a large role in the work. According to program director Peter Good, the consortium is designed such that "all investigators agree to work on the entire ENCODE region, rather than cherry-picking regions ... and agree to the rapid data-sharing requirements." The focus on a common fragment of the genome is intended to help facilitate exhaustive study, collaboration, and resource development, including extensive sequencing efforts.
ENCODE will take advantage of both available sequence and sequence generated specifically for the project. "For the ENCODE regions we will have the best sequence available, so comparative genomics will be important," Good says. According to Margulies, "Our group plans to sequence [the ENCODE regions in] roughly four species per year to a comparative-grade level of finishing." In addition, the NHGRI recently announced a sequencing initiative that will include nine mammalian and nine nonmammalian genomes, largely selected based on their usefulness for comparative genomics.8 The mammalian group will include animals ranging from the rabbit to the African savannah elephant.
Good has high hopes. "When ENCODE is complete, the consortium members will have identified a lot of interesting biology and will have learned how to identify many interesting functional elements. What NHGRI is looking for is a path forward for how we're going to do this for the other 99% [of the genome]."
That path likely will continue to involve phylogenetic footprinting and related techniques, and given the speed and increasingly modest cost of sequencing new genomes, that role will become only stronger over time.5 As Margulies observes, "With every new genome that becomes available, we are making unprecedented gains in decoding the functions of vertebrate genomes."
A SIDE OF CHIPS
Ultimately, however, any factor identified by a computer must be verified at the lab bench. Indeed, the ENCODE initiative sets aside funds for large-scale and exploratory applications of experimental techniques for rapidly identifying functional elements. According to Wasserman, these wet-lab techniques are important for biological validation and for gathering tissue-specific and temporal information that phylogenetic footprinting cannot distinguish. "The ultimate goal is to have a set of known regulatory regions that direct expression to the cell type of interest," Wasserman says.
One popular approach to gaining such validation is called ChIP-chip (or ChIP-on-chip). Blending chromatin immunoprecipitation (ChIP) and genomic microarrays (or "DNA chips"), the technique, says Xiang-Dong Fu of the University of California, San Diego, "provides a snapshot of the protein-binding status of DNA at a particular moment, potentially for the entire genome." This snapshot, he adds, captures proteins related to chromatin structure as well as those engaged in regulatory tasks.
With the ChIP-chip method, proteins are reversibly crosslinked to fragmented DNA followed by addition of a factor-specific antibody to precipitate the complex. After precipitation the protein-bound DNA is released, fluorescently labeled, and applied to a genomic microarray to map its position.
The resulting data depends on the antibody used. Antibodies to "sites of histone modifications can give us a good idea of the locations of many elements," Dunham explains. In contrast, "using antibodies for specific transcription factors can mark a limited set of elements but with much more information about function."
"ChIP-chip and other methods put some experimental data on top of the bioinformatic analyses," says Dunham. "Essentially it would be nice to identify regions that are conserved between species and [that] can be shown to have a function such as binding a particular set of proteins in vivo." Experimental results also provide feedback for improving computational algorithms. "In the end," Good agrees, "You have to do wet-lab validation of any computational approach. The wet-lab result validates the computational approach and is also important for improving [it]."
Jeremy L. Peirce (email@example.com)
1. E. Emberly et al., "Conservation of regulatory elements between two species of Drosophila," BMC Bioinformatics, 4:57, 2003.
2. D. Boffelli et al., "Phylogenetic shadowing of primate sequences to find functional regions of the human genome," Science, 299:1331-3, 2003.
3. I. Ovcharenko et al., "eShadow: a tool for comparing closely related sequences," Genome Res, 14:1191-8, June 1, 2004.
4. G.M. Cooper et al., "Quantitative estimates of sequence divergence for comparative analysis of mammalian genomes," Genome Res, 13:813-20, 2003.
5. T.M. Powledge, "How many genomes are enough?" The Scientist Daily News, Nov. 17, 2003, available online at www.biomedcentral.com/news/20031117/07
6. G. Bejerano et al., "Ultraconserved elements in the human genome," Science, 304:1321-5, May 28, 2004.
7. L. Pray, "Post-genome project launches," The Scientist Daily News, March 5, 2003, available online at www.biomedcentral.com/news/20030305/02
8. "NHGRI adds 18 organisms to sequencing pipeline," National Institutes of Health news release, available online at www.genome.gov/12511858
A wide variety of software tools are related to phylogenetic footprinting, and a number of approaches can integrate other forms of data to improve predictions. Martin Tompa of the University of Washington, Seattle, has developed software known as FootPrinter (http://bio.cs.washington.edu/software.html), which uses available sequences, the phylogenetic tree relating them, and an algorithm for de novo motif discovery to search directly for short, conserved motifs.
ConSite (www.phylofoot.org/consite), developed by Wyeth Wasserman's group at the University of British Columbia, is a Web-based program that identifies conserved regions and uses known binding-site characteristics to identify active sites. Another tool from his group, oPOSSUM (http://sonoma.cmmt.ubc.ca/cgi-bin/oPOSSUM/opossum), helps integrate footprinting with microarray data. Microarrays can help to identify coexpressed, and therefore possibly coregulated genes, Wasserman explains, and oPOSSUM helps to identify types of binding sites overrepresented among these genes.