The Yield of Information from the Human Genome David Nelson, Sapporo, Japan July 26, 2002 Last modified July 18, 2PM. It is a pleasure to be here on my first visit to Japan. Many thanks to Dr. Kamataki, for inviting me, also to all the organizers who worked so hard to make this meeting a success. I thought I would open with a picture [slide 1] from one of my favorite books: The Tale of Genji, about a mythical Japanese prince from 1000 years ago. This picture could be from one of the earliest scientific meetings as Prince Genji presents his paper to an attentive audience. Two years ago, on the other side of the world in Stresa, I spoke on the new rough draft version of the human genome [slide 2 Science cover], just announced in June of that year. At that time only 21% of the human genome was in finished form and 66% was in draft condition, a less than optimal state. [slide 3] The first genome assembly had about 120,000 gaps. Today, the Human Genome Sequencing Center at the Baylor College of Medicine shows that 78% is in finished form. Blast searches of the human genome assembly at NCBI give only 2044 sequences covering 2.9 billion base pairs, so most of the gaps have been filled in. The genome has progressed pretty far in two years. It is time to look at the human genome again and see what is new. In Stresa I reported there were 53 full length P450 genes that probably made functional proteins. I predicted there might be one or two more. Today there are 57 full length P450 genes, so I was off by a little bit. The four new P450s are [slide 4] CYP2W1, CYP4A22, CYP20, CYP26C1. These dates refer to when I became aware of the sequences, not their first appearance in the database. 2W1 is only 42% identical to CYP2D6, so it is a borderline CYP2 member, however it does have 9 exons with the same intron-exon boundaries as other mammalian CYP2s The exceptions to this gene structure are CYP2R1 and 2U1 which only have 5 exons. The 2W1 sequence is seen in mouse and it is 77% identical to human. 4A22 is 95% identical to 4A11. The gene sequence for 4A22 was at first assumed to be the gene that matched the 4A11 mRNA sequences that have been known for a long time. Three different gene sequences were identified for 4A22 and they all matched each other but were different from the mRNA for 4A11, at this point it became clear that 4A22 was a new gene. No mRNA could be found that matched the 4A22 gene, and no gene could be found that matched the 4A11 mRNA. That changed in April this year when the 4A11 gene [slide 5] was deposited in Genbank as accession AL731892. The most recent revision to that accession number was on June 27 less than one month ago, that accession is now complete. 4A11 is the last human P450 gene discovered, even though the mRNA was known much earlier. [slide 6] CYP20 was found in human pheochromocytoma cells by the Chinese National Human Genome Center at Shanghai about two years ago. It is the most recent new vertebrate P450 family to be discovered. It was not found by me by blast searching for new P450s, because it has only 23% sequence identity to CYP3A4 over 437 aa. CYP20 is 27% identical to a sponge p450. The heme signature is short by one amino acid, but the RYG, EXXR, WXXP and PERF motifs are present. CYP20 does not have the usual conserved I-helix motif AGX(D,E)T so its substrate may carry its own oxygen. The ortholog is found in cow (90% identical) mouse (82%) and Fugu (59%) so it is more than 420 million years old. [slide 7] 26C1 was discovered while doing a Blast search at the Stresa MDO 2000 meeting. That demonstrates that interesting things can happen at these meetings, and science is not only done in the lab. CYP26C1 is related to the retinoic acid metabolizing P450s CYP26A1 and CYP26B1. Some human P450s are not represented in the human EST database, or they only have one or two ESTs. [slide 8] 2A13 has only one EST in the Unigene database from lung cystic fibrosis epithelial cells A paper by DingŐs lab showed CYP2A13 mRNA is expressed at the highest level in the nasal mucosa, followed by the lung and the trachea. 2C19 had no ESTs in Unigene. A blast search of 1000bp from the 3 prime UTR found no candidates for a 2C19 EST This sequence was cloned in Joyce GoldsteinŐs lab as a single clone from 83 full length P450 clones in a liver cDNA library, so it was a rare cDNA even when trying to clone p450 mRNAs. 4A22 is not listed in Unigene since it is too new. However, blasts of the human ESTs showed no hits. A blast search of 1000bp from the 3 prime UTR found 4A11 sequences (T95288 AI261826 AV690226) but no 4A22 ESTs. 26C1 is also not listed in Unigene yet. Blasts of the human ESTs showed no hits. A blast search of 1000bp from the 3 prime UTR also found no hits. 27C1 has 2 ESTs from astrocytoma and testis, so at least we know that this gene is expressed. We donŐt really know yet if 4A22 or 26C1 are expressed genes. There are 4.5 million human ESTs in dbEST release 062802 from June 28, 2002. The absence or very low representation of a p450 in this database indicates low levels of expression in the tissue libraries represented, or expression during limited time windows. There is the possibility that some of these genes like 26C1 might be important transiently during development. The assembly of the human genome and the creation of several genome browsers like Map Viewer at NCBI and Ensembl or the UC Santa Cruz browser, makes it possible to map the locations of human genes. I have done exhaustive blast searches of the human genome assembly and mapped all the P450 genes and pseudogenes on ideograms of the human chromosomes. I will now show you the locations of these genes in the next five slides. [slide 9] The 57 functional P450 genes are shown in red, 47 pseudogenes are in blue. The cluster of CYP4A, 4B, 4X and 4Z sequences on chromosome 1 has 5 functional genes and 3 pseudogenes. 46A4P and 2J2 are outside this block, but I could not show that, since there is limited room. The 4Z2P sequence is a full length pseudogene with only one stop codon in exon 8, so it is probably a very recently formed pseudogene. On chromosome 2 notice the 5 4F pseudogenes. For some unkown reason the 4F subfamily has generated about 15 pseudogene fragments that appear on 5 different chromosomes. These are small pieces, not full length genes. Chromosome 3 has CYP51P1, one of three CYP51 pseudogenes, all on different chromosomes. 2D31P is the only 2D pseudogene outside the 2D locus on chromosome 22. Chromosome 4 is interesting because it has 4V2 mapped to three different locations. The Santa Cruz browser has all three locations and the Ensembl browser has the lower two. Two of these are probably errors, but it shows that the mapping and genome assembly process is still imperfect. [slide 10] Chromosome 5 does not have any P450s, so it is not included here. There are some CYP2 sequences mapped to chromosome 5 but they were sequenced in the same lab as chromosome 19 clones where the real CYP2 gene cluster is located and this is apparently a misslabeled clone. The CYP39A1 gene is mapped to the centromere by MapViewer, but this is probably incorrect. Ensembl maps it to the top of 6p12.3, so MapViewer is probably off in its locations for some genes. Chromosome 7 has the 3A cluster with 4 active genes and 3 pseudogenes. P450 clusters always seem to have pseudogene fragments interspersed with the whole genes. [slide 11] Chromosome 10 has the 2C subfamily cluster [slide 12] Chromosome 16 and 17 have no p450s and 18 has only one small pseudogene fragment. Chr 19 is very full of P450s in two different clusters, the 4F cluster with 6 functional genes and the CYP2 cluster also with 6 functional genes. Both clusters have their share of pseudogene pieces. [slide 13] The X chr has one 2C pseudogene fragment. There are five of these scattered outside the the 2C cluster. About half of all the P450 pseudogenes are of 2F or 2C origin. The mouse genome is not as complete as the human and I have not tried to map the mouse p450s on the genome assembly yet. I can make some general remarks about the mouse P450s. There are 84 known full length mouse p450s and 27C1 is expected because it is found in humans and Fugu. 4Z1 is also expected since it is found in human. That is 29 more functional genes than seen in humans. [slide 14] The CYP2 cluster in humans had 6 functional genes. Mice have 12 in those same subfamilies. Humans have one 2D6 gene while mice have seven 2d genes. Humans have 2J2 while mice have at least 5 2js. We have 4 3As mice have 6. Humans have two 4As and mice have 4. Humans have 6 4Fs, mice have 9. Human mouse CYP2 cluster 6 12 CYP2C cluster 4 10 CYP2D 1 7 CYP2J 1 5 CYP3A 4 6 CYP4A 2 4 CYP4F cluster 6 9 Total 24 53 (29 more) Not shown in the slide is the 2C luster. Humans have 4 2Cs while mice have at least 9. This accounts for 28 extra sequences seen in mouse as compared to human. Aside from expansion of these three families, mouse and human are very similar. The same families and subfamilies are present. Where species comparisons get more interesting is in comparing human to Fugu, the Japanese pufferfish. [slide 15 of Fugu]. The Fugu genome has been assembled and it is nearly complete. I have done searches to find all the P450s that I could find in this genome and I have named them. There are 71 non- overlapping contigs of P450 sequences assembled from the Fugu genome. 45 of these are complete P450 genes and one is a nearly intact pseudogene. 8 more are missing only one or two exons or less. The next three slides show a comparison of the human and Fugu P450s side by side. Ray finned fishes and tetrapods like ourselves diverged 420 million years ago, so it is interesting to see how similar we are to fish at the level of our p450s. This first slide [slide 16] shows the CYP2 family. It is the most diverged of all the P450 families. Note that only CYP2R1 and 2U1 are conserved as subfamilies. These genes have only 5 exons as compared to 9 exons in typical CYP2s. The other subfamilies shown here have diverged so they are no longer recognizable as belonging to a common fish and mammalian subfamily. The conservation of 2R1 and 2U1 argues that they may be acting on conserved endogenous substrates, while the other genes are acting more on exongenous substrates or mammalian or fish specific endogenous substrates. [slide 17] This next panel shows the 3, 4 and 5 families and 1, 17 and 21. We can see more lines drawn connecting subfamilies in this section. Notice that fish have a 1C subfamily and a 3B subfamily not seen in mammals. The 4F subfamily has only one member in Fugu, probably the ancestral condition, with expansion of 4Fs in the mammals. 4A, B, Z and X may be derived from the 4T subfamily in fish. This will be more apparent when I show you a tree of these sequences. [slide 18] The third panel shows the most conserved sequences. These are cholesterol, steroid, bile acid and retinoid metabolizing P450s (except for CYP20 which has no known function). There is a one to one relationship between these sequences except for CYP39. CYP39 does not exist in Fugu or any other fish genomes searched so far. CYP39 seems to be the only mammalian innovation in p450s in 420 million years. It catalyzes the same reaction as CYP7B1 to make 7 alpha hydroxylated bile acids. The 7B subfamily is also missing in fish, though there is a 7C subfamily that could have an equivalent role. The relationships between these sequences are shown in more detail in this phylogenetic tree [slide 19] of 60 human, 54 Fugu and 8 other fish sequences. The tree can be viewed as a small number of major branches that I have named clans. These sequences always cluster together on trees and probably share a common ancestor over 400 million years ago. Some of these clans are seen in invertebrates so they are older than 600 million years. In the top half of the tree we see alternation of red and blue branches. This indicates a 1:1 corresponence between Fugu and human sequences that we saw in the last slide. This begins to change in the 4 clan where there are clusters of red human sequences and only 2 blue sequences. This shows an expansion in human. The 4A, B, X and Z subfamilies seeem to origfinate from the fish 4T sequences. The 3 clan has expanded in both fish and human, but after they diverged. The 2 clan shows the greatest expansion with clusters of red and blue sequences that have formed after fish and tetrapods diverged. The next level of genome comparison as we step back on the evolutionary time scale is the urochordate Ciona or the sea squirt. Two genomes of Ciona species are being done [slide 20, 21] Ciona savignyi and Ciona intestinalis shown here. In the larval stage they have a tadpole appearance, with a notochord. The two genomes are only about 70% identical to each other, which is less than mouse and human. I have assembled over 800 sequence fragments from Ciona into about 200 contigs so far, with 18 full length genes assembled. This is a slow process because the relationships to known P450s is much less than seen in the fish, and the genome is not assembled yet, so the reads are all out of order and unlinked, which makes assembly difficult. Ciona has 68 unique heme signatures so this is probably the approximate number of P450s. A majority are related to the CYP2s, but their intron- exon structure does not preserve even one boundary. The complete Ciona P450 set will afford a better view on the evolution of the P450 family in the deuterostome line and in the chordates. I opened with one wood block print and I will close with another. [slide 22] Here is a curious human lifting the veil at the edge of the known and peering out into the inner workings of the universe. I feel we are at this point today with the genome projects going forth. In a very short time we should have a detailed view of the molecular evolution of life on earth. It is a great time to be a biologist.