MSCI814 Module 9.
Comparative Genomics/Human and Chimpanzee
David Nelson March 4, 2004
Bioinformatics home page (Under construction)
NCBI sequence viewer for retreiving a Genbank sequence
EMBOSS translator For translating nucleotide sequence to protein
Human P450 Blast server For comparing a sequence against human P450s
Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequencesUCSC bioinformatics server for genome browsers use BLAT search of chimp
The human genome has been sequenced by both public and private groups to better than 10X coverage (meaning every base on average has been sequenced 10 times or more). The most recent build of the human genome has 498 contigs, which means there are 498 pieces of continuous ordered sequence, with a few small regions of uncalled bases indicated by runs of NNNNNNN. Since humans have 23 pairs of chromosomes: 22 autosomes and X and Y, we should have 24 contigs if the whole genome was completely done. The region around the centromeres is often repetitive and difficult to sequence so this is one place where many of the human chromosomes are not finished. The 498 contigs less the 24 complete chromosomes leaves 474 gaps in the human genome. This is not bad considering the publications in Nature and Science in Feb. 2001 had about 120,000 gaps remaining.
On Dec. 10, 2003, the Chimpanzee genome was released. The chimp genome has been sequenced to 4X coverage, which leaves a lot of uncovered regions, but the sequence could be aligned to the human genome. This meant that almost every piece of chimp sequence could be correctly placed and oriented by using human as a guide. The UCSC Genome browser has the aligned sequence available so it can be viewed in alignment with the human sequence.
An article in PNAS Oct. 15, vol. 99, 13633-13635 (2002) showed that the old standard percentage difference often quoted for humans and chimps (98.5%) is misleading. It does not consider insertions and deletions in the DNA. By comparing 779 kb of sequence, RJ Britten found 1.4% nucleotide substitutions (this equals the 98.6% number often cited), but there are also 3.4% indels (insertions and deletions). This comes to a total of about 5% difference.
One item not addressed in this article was divergence of pseudogenes between the two genomes. Pseudogenes are defective genes that have been broken by frameshifts, stop codons, loss of exons or insertions of large pieces of DNA to separate the ends of the gene. Sometimes chromosome rearrangements can occur in the middle of a functional gene causing it to be disrupted. In humans this sometimes contributes to cancer by breaking a tumor suppressor gene.
Some pseudogenes are not whole genes, but just one or a few exons that were duplicated. These are often close to the original gene or even inside it. See figure 2B in The paper (Pharmacogentics 14, 1-18 2004) for examples. Each dot represents an exon. The full length genes have 9 exons. The pseudogenes have less than that in this example, but a pseudogene can be full length (see Figure 2A 2G2P, 2B7P1, 2T2P). The open circle in 2s1 is an extra exon between exons 3 and 4. The arrows on the bottom of the figure show the orientation of the gene. The scale on the bottom is in millions of base pairs. These gene clusters are about a half million to a million base pairs long.
Another genome feature not discussed in the PBNAS article was the evolution of gene clusters. In many species, genes multiply by tandem duplication making copies of themselves in a small region of the genome. This is called a gene cluster. There may be just two genes or as many as 15 genes in a gene cluster. These are also active sites of pseudogene formation, often just spare exons scattered in the gene cluster (called detritus exons).
Today we are going to look at Human-Chimp P450 gene clusters and look for changes in the number of genes in a cluster, the order and orientation of genes in the cluster and pseudogenes in the cluster. We will be creating a map of seven gene clusters of P450 genes. These are in the paper I gave you comparing human and mouse P450s. Humans have 57 P450 genes, with 58 pseudogenes. Some mouse clusters with multiple genes have only one gene in humans (Figure 5 C and D). We will want to know if chimp also has one gene or more than one.
The clusters shown in the paper are
2ABFGST 6 genes + 8 pseudogenes
2C 4 genes + 4 pseudogenes
3A 4 genes + 9 pseudogenes
4ABXZ 5 genes + 5 pseudogenes
4F 6 genes + 5 pseudogenes
2D 1 gene + 2 pseudogenes
2J 1 gene + 0 pseudogenes
These clusters contain 27 of the 57 human P450 genes and 33 of 58 pseudogenes.
For locations of all genes and pseudogenes on the human chromosomes see the ideogram links at the human P450 data page
This page also has the human P450 sequences and the pseudogene sequences in the FASTA file. You will need to get these for blast searching.
I will be acting as secretary this session. There will be no assignment this week and next week is spring break so we will not meet. What we need to do is search search the chimp genome at the UCSC browser with each of the P450 sequences from human. They will have nearly exact matches, though they will be broken into exons.
Take a P450 sequence and search it against the chimp genome. Verify that it is an excellent match, then look at the details section
For example I will use CYP11B1 and CYP11B2 on human chromosome 8. These genes form a two gene cluster (not in the paper).
>CYP11B1 NM_000497
MALRAKAEVCMAVPWLSLQRAQALGTRAARVPRTVLPFEAMPRR
PGNRWLRLLQIWREQGYEDLHLEVHQTFQELGPIFRYDLGGAGMVCVMLPEDVEKLQQ
VDSLHPHRMSLEPWVAYRQHRGHKCGVFLLNGPEWRFNRLRLNPEVLSPNAVQRFLPM
VDAVARDFSQALKKKVLQNARGSLTLDVQPSIFHYTIEASNLALFGERLGLVGHSPSS
ASLNFLHALEVMFKSTVQLMFMPRSLSRWTSPKVWKEHFEAWDCIFQYGDNCIQKIYQ
ELAFSRPQQYTSIVAELLLNAELSPDAIKANSMELTAGSVDTTVFPLLMTLFELARNP
NVQQALRQESLAAAASISEHPQKATTELPLLRAALKETLRLYPVGLFLERVASSDLVL
QNYHIPAGTLVRVFLYSLGRNPALFPRPERYNPQRWLDIRGSGRNFYHVPFGFGMRQC
LGRRLAEAEMLLLLHHVLKHLQVETLTQEDIKMVYSFILRPSMCPLLTFRAIN
>CYP11B2 NM_000498
MALRAKAEVCVAAPWLSLQRARALGTRAARAPRTVLPFEAMPQH
PGNRWLRLLQIWREQGYEHLHLEMHQTFQELGPIFRYNLGGPRMVCVMLPEDVEKLQQ
VDSLHPCRMILEPWVAYRQHRGHKCGVFLLNGPEWRFNRLRLNPDVLSPKAVQRFLPM
VDAVARDFSQALKKKVLQNARGSLTLDVQPSIFHYTIEASNLALFGERLGLVGHSPSS
ASLNFLHALEVMFKSTVQLMFMPRSLSRWISPKVWKEHFEAWDCIFQYGDNCIQKIYQ
ELAFNRPQHYTGIVAELLLKAELSLEAIKANSMELTAGSVDTTAFPLLMTLFELARNP
DVQQILRQESLAAAASISEHPQKATTELPLLRAALKETLRLYPVGLFLERVVSSDLVL
QNYHIPAGTLVQVFLYSLGRNAALFPRPERYNPQRWLDIRGSGRNFHHVPFGFGMRQC
LGRRLAEAEMLLLLHHVLKHFLVETLTQEDIKMVYSFILRPGTSPLLTFRAIN
Farther down on the details page is the side by side alignment of each exon.
We need to get the start and stop nucleotide number information for the beginning of the coding sequence (start Methionine, exon 1 = 147279934) and the last amino acid in the last exon 147266643. We also need to know the chimp chromosome # and the gene orientation. This is at the top of the details page. [Chimp.chr7 (reverse strand):]. Also get the percentage identity for the whole protein.
Note that this info is on the initial results page from the BLAT server, but you should go in and look at the details page to see if there are large gaps etc.
Pay attention to the way numbering is reported. In the first line the orientation is +-, that means the search sequence was plus orientation and the match was minus orientation. Notice that the numbering has the end of the protein first and the start is second. That will be reversed if the orientation is ++.
Put his info on a page and print it out and bring it to me, the chimp genome secretary.
CYP11B2, Chimp.chr7 (reverse strand):
Start = 147279934 end = 147266643
I will map out the structure of the gene clusters "live" in class with this information.
It is very important that you have an orthologous match. If a gene is deleted, the next most similar gene will probably be less than 96% identical. Pay attention to that. Also be aware that some genes could be duplicated in the chimp. I know of one example so far and I will be looking to see if you find it.
It may also be helpful to look at the genome browser window and zoom out 10X several times to see the structure of the genes in the region. This may simplify your work. Compare the chimp gene cluster with the orthologous one in the paper.
I do not know what to expect for the pseudogenes. This will be new for me as well as you. Chimp and human have bee nseparated for about 6 million years, so they may be pretty different in the pseudogenes. Because the BLAT server is made for high percentage matches only, we may not find all the pseudogenes, even if they are there. This would require BLAST searches, not BLAT searches and we won't do that in this class.
In our example, a search for 11B1 finds the same results as for 11b2. That is due to the high similarity between these two genes. Looking at the browser window after zooming out, we see two P450 11B genes: 11B1 and 11B2. Bu there is something very unusual about these genes. They seem to share the first two exons (on the right side). So what used to be two independent genes have become alternative splice variants of one gene with 7 variable exons and two constant exons. Look for odd stuff like this.
Some things of interest to look for:
The CYP27C1 gene is missing in rodents due to a chromosome rearrangement that deleted the CYP27C1 gene. Is it in chimps?
The CYP4Z1 gene is not assembled very well in human, can you find a full length CYP4Z1 in chimp. This gene is absent in mice and may be primate specific or human specific.
Humans have several pseudogenes that are nearly intact and they may be real genes in chimps, not pseudogenes. CYP2T2P, CYP2T3P, CYP2G2P, CYP2AB1P, CYP2AC1P
To help keep this organized, please come to the front and select a gene or pseudogene to work on. Mark it with a colored electronic pen on the display. That way duplication of effort will be small. Get the sequences at this link