A First Introduction to NCBI

David R. Nelson rev. Feb. 21, 2001



   The first place to go for access to sequence data is NCBI the National Center for Biotechnology Information. There is no more comprehensive site. If you will be using bioinformatics at all in the future, then you should create a bookmark file for your bioinformatics links so they will be easily accessible. I have set my browser home button to the NCBI site and one of my first bookmarks is to the blast server at NCBI. The link to NCBI is NCBI. For convenience you should make a separate browser window for these notes so you can move back and forth between screens. In Netscape select FILE, NEW NAVIGATOR, then paste in the URL for this page so you will have a duplicate.

There are many layers and functions at NCBI and we will not go over them all now. To start, I will give you a brief summary of the sequence content in some of the most useful sections of the database. First, go to the Genbank growth statistics

You can see there are about 11 billion nucleotides in 10 million sequences as of Jan. 16, 2001. The growth rate is very steep, with almost 8 billion bases being added in 2000. The following table is taken from the release notes to release 121 of Genbank December 15, 2000.



2.2.7 Selected Per-Organism Statistics 

The following table provides the number of entries and bases of DNA/RNA for
the twenty most sequenced organisms in Release 121.0 (chloroplast and mitochon-
drial sequences not included):

Entries      Bases   Species

3918724 6702881570   Homo sapiens
2456194 1291602139   Mus musculus
166554   487561384   Drosophila melanogaster
181388   242674129   Arabidopsis thaliana
114553   203544197   Caenorhabditis elegans
188993   165539271   Tetraodon nigroviridis
151411   125948974   Oryza sativa
218598   106344366   Rattus norvegicus
159473    71215626   Bos taurus
141802    62817102   Glycine max
104535    50991920   Medicago truncatula
91334     49855996   Trypanosoma brucei
97112     49415566   Lycopersicon esculentum
54328     47639714   Giardia intestinalis
77532     47590936   Strongylocentrotus purpuratus
49938     44522016   Entamoeba histolytica
57779     44489692   Hordeum vulgare
83726     40906902   Danio rerio
77506     36885212   Zea mays
18361     32779082   Saccharomyces cerevisiae

Follow the link to molecular databases from the left frame of the NCBI homepage. This page has links on the left frame to the databases EST, GSS and STS. Click on the EST link. At the bottom center click on the Number of ESTs - dbEST summary by organism. I have copied the entries with more than 50,000 ESTs. ESTs are sequence fragments derived from cDNAs, so they represent genes that are expressed as mRNA.


Summary by Organism - February 9, 2001

Number of public entries: 7,352,227

Homo sapiens (human)                                3,169,953
Mus musculus + domesticus (mouse)                   1,936,347
Rattus sp. (rat)                                      263,362
Bos taurus (cattle)                                   160,371
Glycine max (soybean)                                 147,859
Drosophila melanogaster (fruit fly)                   116,471
Arabidopsis thaliana (thale cress)                    112,500
Caenorhabditis elegans (nematode)                     109,215
Lycopersicon esculentum (tomato)                      107,298
Medicago truncatula (barrel medic)                    101,752
Danio rerio (zebrafish)                                79,237
Zea mays (maize)                                       76,069
Oryza sativa (rice)                                    70,969
Hordeum vulgare (barley)                               68,665
Xenopus laevis (African clawed frog)                   66,076
Chlamydomonas reinhardtii                              64,973
Sus scrofa (pig)                                       57,060
Triticum aestivum (wheat)                              54,701
Sorghum bicolor (sorghum)                              51,888
Notice that mouse and human have over 5 million ESTs between them. You may be surprised to see that soybean has more ESTs that Drosophila, C. elegans or Arabidopsis, the three completed eukaryote model organisms.
Now go back and follow the link to the GSS database. Scroll down the page to the release notes for the current version and click.


dbGSS release 020901 

Summary by Organism - February 9, 2001



Number of public entries: 2,211,010


Homo sapiens (human)                    866,507
Mus musculus                            612,512
Tetraodon nigroviridis                  188,963
Oryza sativa (rice)                      93,107
Trypanosoma brucei                       90,540
Strongylocentrotus purpuratus            76,019
Arabidopsis thaliana                     61,265
Entamoeba histolytica                    49,129
Drosophila melanogaster                  44,787
Fugu rubripes                            42,896
Magnaporthe grisea (rice blast fungus)   12,674
Trypanosoma cruzi                        12,245
Leishmania major                         11,929

Here are the data for the top organism represented in the GSS database. This is a database of genomic fragments about the same size as ESTs, but these of course contain introns and non-coding regions as well as exons. Please note the organisms represented are different than the ESTs. This can be of help sometimes. The third most abundant organism is Tetraodon nigroviridis, the freshwater pufferfish. These fish genes can be surprisingly similar to human and they are helpful when trying to assemble genes from genomic DNA when you cannot find the intron-exon boundaries. Do not forget them, they can help you.

Below is the same data for the STS section. You can get to it by going back and clicking on STS from the databases page. There are not as many STS fragments. These are genomic sequences made for mapping and most of them have been placed on genetic maps. They are also a last resort if your gene is missing a small fragment and you cannot find this piece anywhere else. It might be here. Keep in mind the human genome sequence as it is now has 170,000 contigs and so there are that many gaps in the sequence. You may need a small piece to fill in a hole in your gene.


dbSTS release 020901 

Summary by Organism - February 9, 2001



Number of public entries: 93,728


Homo sapiens (human)                        69,963
Rattus norvegicus (Norway rat)               8,361
Danio rerio (zebrafish)                      6,036
Drosophila melanogaster (fruit fly)          3,203
Bos taurus (cattle)                          1,109
Plasmodium falciparum (malaria parasite)       869
Mus musculus (house mouse)                     701
Kluyveromyces lactis                           658
Gallus gallus (chicken)                        630
Oryza sativa (rice)                            339
Sus scrofa (pig)                               212
Zea mays (maize)                               212
Cryptosporidium parvum                         161
Oreochromis niloticus (Nile tilapia)           161


That gives you a flavor of what is in Genbank. Now how do you search this data? The method is the BLAST search. NCBI has recently redone its BLAST search interface. Please go to the new BLAST page from the NCBI homepage. You will notice that there are several types of blast search methods depending on what you have and what you want to find. Notice the first three sections on this page. Nucleotide blast searches, Protein blast searches and Translated blast searches. If you have nucleotide sequence that is from an untranslated region, or you do not know what is in it the nucleotide blast Blastn is what you should try to identify your sequence. It may match a known gene. If you have protein sequence and you want to search the known protein translations in Genbank BlastP is one of your choices. The translated blast searches are more powerful and they take more computer time. Tblastn is one of the most useful search tools. This is the one I use almost all of the time with fragments of genes that I have been able to translate into probable open reading frames based on other known protein sequences. With Tblastn you can search your protein sequence against all the nucleotides in Genbank sections that are translated in all six reading frames (three per DNA strand). You can hardly miss anything that matches if it is out there. Nucleotide searches are not as sensitive because there are only four bases but there are 20 amino acids, so there are better search statistics to find a match. The Blastx program lets you find ORFs (open reading frames) in your nucleotide sequence by hunting for matches to translated proteins. This is very helpful in analyzing fresh genomic sequence that is unannotated. The Tblastx program is the most computer intensive since it does the translation of the query sequence in six frames as well as the translation of the nulceotide database in 6 frames, so it is like doing 6 tblastn searches. Not too many servers will even offer this search, since it eats computer time so badly.

The Blast server page has a window to paste your sequence. This may be in FASTA format or just raw sequence. FASTA format always starts with an >IDENTIFIER LINE This may be anything you want to identify your sequence. This will appear in the search output. If you will be doing many searches and saving them or printing them, it is a good idea to include this identifier line so you will not mix up your output. The > symbol is required, otherwise the program will try to interpret your identifier as amino acid or nucleotide sequence. The sequence can have numbers in it such as are found in the Genbank sequence format, or other formats like GCG or PIR. These will be ignored. Non-standard letters like O, and J will also be ignored. For this example use this sequence.

>Dictyostelium CYP51 P450 C-terminal ETQKDINDIVQKENQGEINFDGLKRMNRLETVIREVLRLHPPLIFLMRK VMTPMEYKGKTIPAGHILAVSPQVGMRLPTVYKNPDSFEPKRFDVED KTPFSFIAFGGGKHGCPGENFGILQIKTIWTVLSTKYNLEVGPVPPTD FTSLVAGPKGPCMVKYSKKQK*

Once your sequence is in the window, you need to select a program, like tblastn from the pull down menu. Choose Tblastn. You also need to select a database from another pull down menu. Select nr (the default). We will talk about the other options later. Below the sequence window is the blast button. Do not click this yet. Futher down the screen is the filter option. I usually recommend turning the filter off, unless there is a weird repetive sequence like a run of QQQQQ in your sequence. If you leave the filter on, you may find some unpleasant runs of XXXXXXX in the blast output where you do not want it. Now one of the best feratures of this blast page is the limitation of sequences to be searched. Here you can use the pull down menu to select a common species like mouse (Mus musculus) or a whole taxonomic range of organisms like green plants (Viridiplantae) or Fungi. I really wish they had some more of these ranges incorporated here, but you can also choose to type in a more specific set yourself like Diplomonadida or Rhodophyta. I use this feature all the time to restrict the search output to just what I want. You may select Dictyostelium discoideum from the pull down menu if you want to find the exact sequence you are searching with, or you may expand the search to closely related organisms by typing in Mycetozoa. I think it may be more interesting to use Fungi on this first blast, since there are many CYP51s known from fungi in the nr database. Go ahead and click the blast button.

You will get a screen that says Format on a button. Click this button to begin formatting your results. This will give you another window that says how long the process will take before the next refresh. This could be short or long depending on usage at NCBI. Monday mornings are bad. Sometimes they say it will be 60 minutes or more. Evenings it may turn around in less than a minute. The output will have a graphical box at the top that has colored bars representing the hits. Red is for the highest scoring hit and the colors get cooler as the scores drop. Red hits are generally pretty good matches. Since we searched a slime mold vs. Fungi here, the best hits are magenta not red. Below the graphics box is a text list of your hits with the best scores at the top. The e-19 numbers refer to the chance that this was an accidental match. The larger the negative exponent the better the match and the lower the probability that it was a chance occurrence. Below the text list are the actual alignments. This is where you see what you have found. The output here shows a gap in the Dicty sequence compared to most of the fungal sequences. Since fungi have relatively few introns, this is probably not an intron in the fungi, but it may be an insertion that is found in fungi and not in the slime mold sequence. Notice that all the top hits are identified as CYP51 sequences or 14 alpha demethylase enzymes. This is because the cytochrome P450 family is extensively annotated and over 1000 of the genes are named. Even the new sequences that are not officially named are usually identified by this type of blast search and the authors often tag them as a P450 (in this case CYP51).

Go to Module 2