Blast Searching at NCBI

David R. Nelson rev. Feb. 21, 2001



   Your first search was intended to get you started in blast searching with a simple example. There are many aspects of this tool that we need to discuss in more detail. The first is the selection of the different databases to search. Most first time users of a software package tend to use the default settings. In this case that could be disasterous, because you could miss a large number of sequences that are not to be found in the nr section of Genbank (the default). There are large EST projects on some organisms that will make only EST sequences. These will not be found in nr. One example is soybean (Glycine max) with 148,000 ESTs. If you are after a plant gene, it may be present in these ESTs but not in nr. The 160,000 Bos taurus (cattle) ESTs can also be used to help assemble human or mouse genes, by indicating the location of introns. Another example is Giardia (hiker’s diarrhea). Giardia is a common human pathogen and it is thought to be a very ancient eukaryote. Therefore, it is valuable to sequence for medical and evolutionary studies. The Giardia Genome project is being conducted at Woods Hole Oceanographic Institute and the data is searchable at NCBI in the HTGS section of Genbank. (HTGS = High Throughput Genomic Sequences).

If you have not added the blast page at NCBI to your bioinformatics bookmark file, please do it now. NCBI Blast page

I recommend that you add the TBLASTN page also. TBLASTN

You will be needing these over and over as we progress. As a general rule, it is good policy to search nr, HTGS, EST and GSS sections of Genbank when hunting for matches to a new sequence. You can also search STS if you are desperate, but there are not very many entries in STS.

The following is a real example taken from an alignment of mitochondrial carrier sequences. Mitochondrial carriers are membrane transport proteins of the inner mitochondrial membrane (and peroxisomal and hydrogenosomal membranes). These carriers enable communication betweeen the mitochondrial matrix and the cytosol. They are responsible for exporting ATP to the cytosol and importing ADP into the matrix to be remade in to ATP. There are 46 of these genes in humans and similar numbers in other species (35 in yeast). I am posting a link to a sequence alignment of 265 of these carriers that has some gaps in some of the sequences. We will try to fill in some of these gaps. Once you are at the page use the find command to search for GMPA-. This should take you to a small gap in sequence 92. This gap was not possible to fill when the alignment was last revised, because this sequence did not exist in Genbank. The sequence looks like this with eight missing amino acids.

ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQ-----TTYNGVT

The second gap is an alignment gap and there are no missing amino acids. Copy the sequence above and paste it in the blast window at NCBI. The dashes will be automatically removed as illegal characters. Select the mouse EST database and turn the filter off. Notice that the EST database has several options including EST, mouse EST, human EST and others EST. This is helpful to reduce searching unwanted sequences. If you want to find only mouse, use the mouse EST option. For all mammals use the pull down menu under options and select mammalia. For this search just use mouse ESTs. Do the blast.


>gi|1387895|gb|W76821.1|W76821 me73f10.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA
           clone IMAGE:401227 5' similar to WP:K02F3.2 CE01348 ;.
          Length = 455

 Score = 48.1 bits (113), Expect = 2e-05
 Identities = 23/23 (100%), Positives = 23/23 (100%)
 Frame = +3

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA 23
           ANEDGQVSPGSLLLAGAIAGMPA
Sbjct: 384 ANEDGQVSPGSLLLAGAIAGMPA 452

>gi|1387483|gb|W77458.1|W77458 me66c04.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA
          clone IMAGE:400518 5' similar to WP:K02F3.2 CE01348 ;.
          Length = 471

 Score = 43.9 bits (102), Expect = 3e-04
 Identities = 21/21 (100%), Positives = 21/21 (100%)
 Frame = +2

Query: 24 VIKTRLQVAARAGQTTYNGVT 44
          VIKTRLQVAARAGQTTYNGVT
Sbjct: 2  VIKTRLQVAARAGQTTYNGVT 64

Shown above are the two ESTs that were at the existing boundaries of the small gap. Look at the first one W76821.1 (this is an accession number for the sequence. It has a version number. Some version numbers may be as high as 29). Notice on the 4th line that the Length is 455 nucleotides. Now look at the alignment. The upper sequence in the alignment is your query sequence. The lower sequence is the mouse EST hit. The line in between shows the matches between the two sequences. Look at the numbering on the lower sequence from 384 to 452. This is only three nucleotides from the end of the sequence. So this sequence stops where our gap was. The second EST W77458.1 starts where the gap ends. It has only one nucleotide protruding into the gap. If you click on the accession number of the first sequence it takes you to the sequence entry which was created on June 20, 1996. The second est was created on the same date. Notice that both ESTs have another identifier that looks like this: me73f10.r1. The r1 on the end of this name indicates it was sequenced from a reverse primer. This information may be valuable, since the insert of this clone may have been sequenced from the opposite end of the clone. That sequence would have a different accession number, but it would have the same clone name with .s1 at the end instead of .r1. The s1 indicates a standard read from a forward primer. The two sequences would be from the same clone, but from opposite ends. This can unite sequence fragmets that do not overlap in the middle. You can search for the clone name (without the .r1 part) in the NCBI homepage search window. If you find the .s1 match then you have more sequence information to use.

We have seen that the gap in the original sequence was caused by these sequences ending just a little short of overlapping. The two best hits in this search do span the gap, allowing the missing sequence to be filled in.




>gi|9904727|gb|BE624311.1|BE624311 uu45a06.y1 Soares_thymus_2NbMT Mus musculus cDNA clone
           IMAGE:3374866 5' similar to TR:O14575 O14575 SIMILAR TO
           ADP/ATP CARRIER PROTEINS ;.
          Length = 480

 Score = 80.1 bits (196), Expect = 4e-15
 Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%)
 Frame = +1

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44
           ANEDGQVSPGSLLLAGAIAGMPA        VIKTRLQVAARAGQTTYNGVT
Sbjct: 283 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 438

>gi|3519846|gb|AI119522.1|AI119522 uf04h04.y1 Sugano mouse liver mlia Mus musculus cDNA clone
           IMAGE:1499671 5' similar to WP:K02F3.2 CE01348 ;.
          Length = 670

 Score = 80.1 bits (196), Expect = 4e-15
 Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%)
 Frame = +2

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44
           ANEDGQVSPGSLLLAGAIAGMPA        VIKTRLQVAARAGQTTYNGVT
Sbjct: 482 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 637
BE624311 uu45a06.y1 was created August 24, 2000 so it is fairly new. The second one was created Sept. 2, 1998. Note that the clone name has a .y1 after it. That pairs up with a .z1 of the same name. The hit shown below does not come from the same gene. It is nearly identical on the right side, but not on the left. You need to be careful, especially with short sequences, that you are not making chimeric or hybrid sequences when you assemble protein sequences from fragments.




>gi|10753840|gb|BF022507.1|BF022507 uy43g06.y1 NCI_CGAP_Lu30 Mus musculus cDNA clone IMAGE:3662362 5'
           similar to TR:O75746 O75746 ARALAR1 PROTEIN. ;.
          Length = 450

 Score = 56.6 bits (135), Expect = 5e-08
 Identities = 32/51 (62%), Positives = 39/51 (75%), Gaps = 8/51 (15%)
 Frame = +3

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGV 43
           A+E+G+V   +LL AGA+AG+PA        VIKTRLQVAARAGQTTY+GV
Sbjct: 294 ADENGRVGGINLLTAGALAGVPAASLVTPADVIKTRLQVAARAGQTTYSGV 446


That was an easy example of using the EST database to find a missing sequence fragment from a partial gene. Now we will try a harder problem with a larger gap, also in a mouse carrier. Go back to the sequence alignment link at the top of this page and search for -SFLAG in the alignment. You should find sequence 96 as shown below.



segment 3
94A ----------PRKSDGSGEAVFYWSLIAGLLSGMTSAFMVTPFDVVKTRLQADGEKK--------FKGIM    0
 95 ------------FNELAGKASFAHSFVSGCVAGSIAAVAVTPLDVLKTRIQTLKKGLGE----DMYSGIT    0
 96 ------------------------SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNE----DTYSGFL    0
96A ----------PRRNDGSGEAVFWCSFLAGLAAGSTAALAVNPFDVVKTRLQAIKKADGE----KEFKGIS    0

This gap extends upstream over several pages of the alignment. For your convenience they are shown here with some surrounding sequence. This will be important later. Notice that sequence 95 and 96 in the first segment below are nearly 100% identical. Sequence 95 is a human sequence and it is the ortholog of the mouse sequence with the gap. That will be useful to us.



segment 1
 94 GMYRGAAVNLTLVTPEKAIKLAANDFFRHQLS-KDG-----QKLTLLKEMLAGCGAGTCQVIVTTPMEML  112
94A GMYRGSAVNIVLITPEKAIKLTANDFFRYHLASDDG------VIPLSRATLAGGLAGLFQIVVTTPMELL    0
 95 GMYRGAAVNLTLVTPEKAIKLAANDFFRRLLM-EDG-----MQRNLKMEMLAGCGAGMCQVVVTCPMEML    0
 96 GMYRGAAVNLTLVTPEKAIKLAANDFLRH-----------------------------------------    0
96A GMYRGSGVNILLITPEKAIKLTANDYFRHKLTTKDG------KLPLTSQMVAGGLAGAFQIIVTTPMELL    0

segment 2
 94 KIQLQDAGRIAAQRKILAPRSTATQLTRDLLRSRG-IAGLYKGLGATLLRDVPFSVVYFPLFANLNQLG-    0
94A KIQMQDAGRVDRAAGREVKTITALGLTKTLLRERG-IFGLYKGVGATGVRDITFSMVYFPLMAWINDQG-    0
 95 KIQLQDAGRLAVHHQGSARRPSATLIAWELLRTQG-LAGLYRGLGATLLRDIPFSIIYFPLFANLNNLG-    0
 96 ----------------------------------------------------------------------    0
96A KIQMQDAGRVAKLAGKTVEKVSATQLASQLIKDKG-IFGLYKGIGATGLRDVTFSIIYFPLFATLNDLG-    0

Try to do the same technique as before. Copy the beginning and end of sequence 96 and place them in the tblastn window. Select mouse ESTs and turn off the filter. Do the blast. The results will show hits that extend into the gap region.



>gi|11620519|gb|BF533156.1|BF533156 602073695F1 NCI_CGAP_Li9 Mus musculus cDNA clone IMAGE:4210775 5'.
          Length = 890

 Score = 86.3 bits (212), Expect = 4e-17
 Identities = 42/42 (100%), Positives = 42/42 (100%)
 Frame = +1

Query: 30  SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 71
           SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL
Sbjct: 124 SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 249

>gi|10379135|gb|BE861312.1|BE861312 UI-M-AK0-adi-c-11-0-UI.r1 NIH_BMAP_MHY Mus musculus cDNA clone
           UI-M-AK0-adi-c-11-0-UI 5'.
          Length = 538

 Score = 65.5 bits (158), Expect = 8e-11
 Identities = 39/73 (53%), Positives = 43/73 (58%), Gaps = 13/73 (17%)
 Frame = +3

Query: 1   GMYRGAAVNLTLVTPEKAIKLAANDFLRH-------------SFLAGCVAGSAAAVAVNP 47
           GMYRGAAVNLTLVTPEKAIKLAANDFLR                LAGC AG        P
Sbjct: 309 GMYRGAAVNLTLVTPEKAIKLAANDFLRQLLMQDGTQRNLKMEMLAGCGAGICQVGITCP 488

Query: 48  CDVVKTRLQSLER 60
            +++K +LQ   R
Sbjct: 489 KEMLKIQLQDAGR 527

The first hit above (BF533156) extends 123 bp into the gap upstream. You can see this from the numbering of the lower sequence starting at 124. To get this upstream sequence you must click on the accession number from the blast search and go to the sequence entry. Copy the sequence and take it to the quick protein translator from MBS (Molecular Biology Shortcuts). Paste it in the window and select three frames since you know it is in the plus strand and then translate it. Copy all three frames back to a word processor and comparte them to the mouse and the human sequenceabove it to find the correct frame. The DNA sequence of BF533156


GCCAACCTGAATCAGCTGGGCAAACAGGGGGAAGTAGACAATGGAGAAGGGAACATCCCC
CTGTTTGCCAACCTGAATCAGCTGGGCCGCCCATCCTCTGAGGAGAAGTCGCCTTTCTAT
GTGTCCTTCCTAGCAGGCTGCGTGGCTGGGAGTGCAGCCGCTGTGGCCGTCAACCCTTGT
GATGTGGTGAAGACTCGGCTCCAGTCCCTTGAGAGAGGTGTTAATGAGGACACTTACTCT
GGGTTTCTGGACTGTGCAAGGAAGATCTGGAGACATGAAGGTCCCTCAGCCTTCCTGAAA
GGCGCATACTGCCGTGCGCTGGTCATTGCCCCGCTGTTTGGCATCGCCCAGGTGGTCTAC
TTTCTGGGCATTGCCGAGTCCCTGCTGGGGCTGCTGCAAGAACCCCAGGCCTGAGCCCAT
GGCTGCTTCTCTCCAGCCTATGGGCAGGGGCCAGAACAGGGTGACCAGCACAAGCCTGAG
GAGGAGTGGTCTCTCCCCGGTCCTCCTCATTAAGATGGGAAGGCAAGGGGAGGGTGCAGG
GTCCACATGGGTGATGCACACATAAGCCCCTGTGTGGTCCTGAAGGGACAACAAATGGGA
TCGAGGTCTTATCTATGTAGAAAATGCAGAAATCTGTACATCCCTCAAGCCAGTTCTGTC
CCATCCTTGTTACTCAAACCCAGTCCACTGGCTGAACACCCATGGGACAGAGCTGGTCTC
TGGGTGGGGGCCCCAGGCCTGGTTTGGGAGGGGGACCTACCTGGGGTTCACTGGGCCTGG
CCCTGGGGGCCCTGGCTTCCATAGGGGCCCACCCCCGATTTTTTGGGTTCCCCGCCAGGG
GTCTCGGCCGGCGAGCTTGCCGGTGGTCCCCTCGACCTGTCCCCGCTTGG


Only the first three lines are needed to include the first 124 bp. Translation of the first 180bp phase 1 has the known sequence SFLAG... And no stop codons, so this is probably correct if there are no frameshifts. Compare it to the human sequence.



Phase 1 ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC Phase 2 PT*ISWANRGK*TMEKGTSPCLPT*ISWAAHPLRRSRLSMCPS*QAAWLGVQPLWPSTL Phase 3 QPESAGQTGGSRQWRREHPPVCQPESAGPPIL*GEVAFLCVLPSRLRGWECSRCGRQPL Phase 1 mouse ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC ||||||| || | | || |||||| ||||| | YRGLGATLLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL Human 95
At least from PLFAN the sequences match, but upstream they do not and this region must be considered suspect. You have now filled in part of the gap. The region between FANLNNLG and GCVAG is not very strong and may contain a frameshift. Watch for better matches in this region. Look at the other hit BE861312. This has sequence going into the gap from the other side. Compare the sequence with the human sequence.


mouse
QLLMQDGTQRNLKMEMLAGCGAGICQVGITCPKEMLKIQLQDAGR
 ||| || ||||||||||||||| |||  ||||||||||||||||
RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR
Human 95
They are almost identical and so the frame is correct. This fills in much of the gap from the other side. There are only about 48 amino acids missing in the middle now. To try to fill this gap you can repeat the search of the EST database or you can go to another section to find the missing part, the HTGS section. I recommend this. You will find genomic DNA with introns here, so you need to have the human DNA to help identify the missing exon sequence. Construct a hybrid sequence with the new mouse sequence you just found and human from seq 95 to fill in the gap. Then blast the mouse HTGS section of Genbank. In the sequence below human is in lower case.


RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR
lavhhqgsarrpsatliawellrtqglaglyrglgatllrdipfsiiyf
PLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL

The result from this search shows a good match which is probably to the same gene. There are some differences, however this is a genomic sequence and the gene is on a 90,000 bp contig that is probably more accurate than the EST sequences. There are also three different exons here (see the nucleotide numbering). One of the regions with the greatest number of sequence differences is the region that looked like it might contain frameshifts earlier. The differences on the 2nd exon are mouse human differnces.


>gb|AC084273.11|AC084273 Mus musculus clone rp23-313e8, WORKING DRAFT SEQUENCE, 5 unordered pieces
          Length = 182050

Score = 81.6 bits (200), Expect = 2e-15
 Identities = 38/45 (84%), Positives = 40/45 (88%)
 Frame = -2

Query: 9      QRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGS 53
              QRNLKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRL   H  S
Sbjct: 142635 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLGEAHPNS 142501

 Score = 58.5 bits (140), Expect = 2e-08
 Identities = 30/34 (88%), Positives = 31/34 (90%)
 Frame = -3

Query: 55     RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIP 88
              RRPSATLIA ELLRTQGL+GLYRGLGATLLR  P
Sbjct: 139970 RRPSATLIARELLRTQGLSGLYRGLGATLLR*AP 139869

Score = 92.4 bits (228), Expect = 1e-18
 Identities = 44/53 (83%), Positives = 47/53 (88%)
 Frame = -3

Query: 83     LLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL 135
              L RDIPFSIIYFPLFANLN LG +EL GKASF HSFV+GC AGS+AAVAVTPL
Sbjct: 139628 LYRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPL 139470

Probable sequence after assembly
QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXXXXXXX
RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPL
The sequence is almost done now, but there is the small matter of an intron exon joint to be settled. The human sequence has a phase 1 intron after DAGRL and the mouse is expected to have the same intron location. However, the sequence between the human and mouse does not match here for 7 amino acids. It would be desirable to find an EST for mouse that bridged this gap. The following search shows that there is an EST that does this and this sequence indicates that the mouse sequence is longer than the human sequence from the alignment.



>gb|AI848503.1|AI848503 UI-M-AP1-agf-c-11-0-UI.s2 NIH_BMAP_MST_N Mus musculus cDNA clone
           UI-M-AP1-agf-c-11-0-UI 3'.
          Length = 438

 Score =  200 bits (508), Expect = 5e-51
 Identities = 106/131 (80%), Positives = 106/131 (80%), Gaps = 18/131 (13%)
 Frame = -3

Query: 1   QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXX------------------XX 42
           QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRL                      
Sbjct: 415 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGST 236

Query: 43  XXXRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 102
              RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK
Sbjct: 235 STHRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 56

Query: 103 ASFTHSFVAGC 113
           ASFTHSFVAGC
Sbjct: 55  ASFTHSFVAGC 23
This actually identifies a flaw in the alignment where a part of the human gene was left out because it was too long to fit into the existing alignment. The match below is human on the bottom and mouse on the top, showing the extra sequence also exists in a human EST.



>gb|BF329616.1|BF329616 CM2-BN0273-080600-227-b08 BN0273 Homo sapiens cDNA.
          Length = 298

 Score =  167 bits (423), Expect = 9e-41
 Identities = 84/95 (88%), Positives = 88/95 (92%)
 Frame = -3

Query: 4   LKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGSTSTH 63
           LKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRLAV HQ SASA  TSR Y+TGS STH
Sbjct: 287 LKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGSASAPSTSRSYTTGSASTH 108

Query: 64  RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPF 98
           RRPSATLIA ELLRTQGL+GLYRGLGATLLRDIPF
Sbjct: 107 RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIPF 3
This effort now successfully fills in the gap in mouse sequence 96.