The Rice (Oryza sativa) P450 Genbank Inventory
Last modified Nov. 14, 2001
David R. Nelson
There has been much progress in sequencing the rice genome in the past two years. Monsanto
paid for a private sequencing effort that covered the genome to 5X depth. This data was
given to the Rice Genome Initiative to be used in finishing the clone by clone strategy they
are using. Unfortunately, that 5X data was not deposited in Genbank where everyone
could look at it. Monsanto/Pharmacia have decided to make the data available to
researchers by password access after legal documents are signed. There are still some
restrictions that apply. Before gaining access to the 5X data, I have searched Genbank
exhaustively for rice P450 sequences to identify what is already available. Blast searches were
performed on the nr, est, htgs and gss sections using one member from each P450 clan in
plants (9 different sequences = CYP51A2, a 71B, 72A7, 74A, 86A1, 85A, 97A, 710A1, 711A1).
Additional searches were also done as indicated by what was found. If a partial 90C-like
sequence was found in HTGS then HTGS was searched with 90C to find the rest of the sequence.
An alphabetical list of accession numbers was prepared and each search output was compared
against this list for new entries (see below for a convenient way to do this). In Sept. 2000,
20 familes of plant P450s were not found in the rice set. To be sure that small fragments in
the EST or GSS section of Genbank had not been overlooked, especially fragments from less
conserved regions like the extreme C-terminal, blast searches were done with one member from
each of these 20 families against the EST or GSS section of Genbank limited to rice. This
identified six additional entries in three different families. Three of these were identical
sequences containing a small fragment upstream of the I-helix in CYP73. One was from a
different region of CYP73. One was a single small exon including the heme binding site of a
CYP708 P450. One was from a CYP92 sequence. After these nine X 4 searches were done as well as
the additional searches, 622 accession numbers of rice P450 sequences had been identified.
These contained 756 P450 sequence fragments, since some accession numbers held clusters of nine
or even 15 P450 genes. All sequences from the blast output were compared against each other by
Do-It-Yourself Wu-Blast 2.0. or on our new rice blast server after it was online. This resulted
in joining overlapping fragments into larger contigs. The result was 296 contigs, of which 172
are full length P450 sequences, 52 are pseudogene fragments and 72 are partial P450s that might
still be completed. All 296 sequences were compared to a database of all 273 Arabidopsis P450s
plus seven additional P450s from families not present in Arabidopsis (CYP80, CYP92, CYP99,
CYP719, CYP723, CYP725, CYP726). This identified the fragments to specific families. Once this
had been done, there were still 15/53 plant P450 families that had no P450s present in the rice
set. (CYP80, 82, 83, 702, 705, 708, 712, 714, 716, 718, 719, 720, 721, 725, 726) Some of these
were borderline cases and some rice sequences could be placed in these families at 38-39%
identity. Some of these were from unusual plants like euphorbias (CYP726) or yew (CYP725) or
Cryptomeria (719) and it is not surprising that they do not match rice.
Comparing families between the rice and Arabidopsis genomes, 34 of 45 families (76%) were
present in both species. CYP92 was found in rice but not in Arabidopsis. Since the rice genome
is incomplete in the public databases, it appears that most plant P450 families existed before
the monocot-dicot divergence. As more data comes in, the number of families missing between the
two species will probably drop to a very small number. These may be specific to eudicots or
even of more limited range. As mentioned above, there are seven families that are not seen in
Arabidopsis, and this specialization may occur in each major lineage of plants.
The following files and servers are available.
Rice P450 Blast Server
Rice alphanumeric accession number list
Use the accession number to locate the sequence in the contig collection below.
Rice contig collection, 296 sequences Nov. 14, 2001
These are sorted by clan and family
Rice FASTA file this is the same as the contig
collection except the extra information has been stripped out to leave only a single identifier
line with each sequence. Duplicate sequences are removed. The sequence order is not the
same. This file can be used with the Do-It-Yourself blast search.
This is an easy way to compare a new blast search output against an alphabetical listing of
accession numbers to identify new hits. Open your file of alphabetized accession numbers in
Word. Select and copy the blast output accession number list only and place it at the top of
the accession number file. Change it to Courier 9 point font. Select and color the blast
output list red. Using the replace command, replace all gb, emb and dbj occurrences with
nothing. This will delete them from the description list in the blast output and align most of
the accession numbers directly above one another. Those with 6 or 7 digit gi numbers will need
to be adjusted by hand so the accession numbers line up. Using the alt key and by holding down
the mouse, (on a Mac) select the vertical block of text that precedes the accession numbers (the
gi numbers and some vertical bars ||) and delete this block. That will leave the accession
numbers flush against the left edge. Select all the content of the file by apple key + A (on a
Mac). Use the sort command from the table menu on the tool bar and sort all the accession
numbers. You may have to do this twice for some unknown reason. This will sort all the
accession numbers from the blast output in red with all the accession numbers in your list (in
black) and you can visually compare them in a minute or two to identify new hits. This whole
process takes only a couple of minutes and saves lots of time. I have done it with about 100
new accession numbers compared to about 700 old numbers.