>gi|1387895|gb|W76821.1|W76821 me73f10.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA clone IMAGE:401227 5' similar to WP:K02F3.2 CE01348 ;. Length = 455 Score = 48.1 bits (113), Expect = 2e-05 Identities = 23/23 (100%), Positives = 23/23 (100%) Frame = +3 Query: 1 ANEDGQVSPGSLLLAGAIAGMPA 23 ANEDGQVSPGSLLLAGAIAGMPA Sbjct: 384 ANEDGQVSPGSLLLAGAIAGMPA 452 >gi|1387483|gb|W77458.1|W77458 me66c04.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA clone IMAGE:400518 5' similar to WP:K02F3.2 CE01348 ;. Length = 471 Score = 43.9 bits (102), Expect = 3e-04 Identities = 21/21 (100%), Positives = 21/21 (100%) Frame = +2 Query: 24 VIKTRLQVAARAGQTTYNGVT 44 VIKTRLQVAARAGQTTYNGVT Sbjct: 2 VIKTRLQVAARAGQTTYNGVT 64Shown above are the two ESTs that were at the existing boundaries of the small gap. Look at the first one W76821.1 (this is an accession number for the sequence. It has a version number. Some version numbers may be as high as 29). Notice on the 4th line that the Length is 455 nucleotides. Now look at the alignment. The upper sequence in the alignment is your query sequence. The lower sequence is the mouse EST hit. The line in between shows the matches between the two sequences. Look at the numbering on the lower sequence from 384 to 452. This is only three nucleotides from the end of the sequence. So this sequence stops where our gap was. The second EST W77458.1 starts where the gap ends. It has only one nucleotide protruding into the gap. If you click on the accession number of the first sequence it takes you to the sequence entry which was created on June 20, 1996. The second est was created on the same date. Notice that both ESTs have another identifier that looks like this: me73f10.r1. The r1 on the end of this name indicates it was sequenced from a reverse primer. This information may be valuable, since the insert of this clone may have been sequenced from the opposite end of the clone. That sequence would have a different accession number, but it would have the same clone name with .s1 at the end instead of .r1. The s1 indicates a standard read from a forward primer. The two sequences would be from the same clone, but from opposite ends. This can unite sequence fragmets that do not overlap in the middle. You can search for the clone name (without the .r1 part) in the NCBI homepage search window. If you find the .s1 match then you have more sequence information to use.
>gi|9904727|gb|BE624311.1|BE624311 uu45a06.y1 Soares_thymus_2NbMT Mus musculus cDNA clone IMAGE:3374866 5' similar to TR:O14575 O14575 SIMILAR TO ADP/ATP CARRIER PROTEINS ;. Length = 480 Score = 80.1 bits (196), Expect = 4e-15 Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%) Frame = +1 Query: 1 ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44 ANEDGQVSPGSLLLAGAIAGMPA VIKTRLQVAARAGQTTYNGVT Sbjct: 283 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 438 >gi|3519846|gb|AI119522.1|AI119522 uf04h04.y1 Sugano mouse liver mlia Mus musculus cDNA clone IMAGE:1499671 5' similar to WP:K02F3.2 CE01348 ;. Length = 670 Score = 80.1 bits (196), Expect = 4e-15 Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%) Frame = +2 Query: 1 ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44 ANEDGQVSPGSLLLAGAIAGMPA VIKTRLQVAARAGQTTYNGVT Sbjct: 482 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 637BE624311 uu45a06.y1 was created August 24, 2000 so it is fairly new. The second one was created Sept. 2, 1998. Note that the clone name has a .y1 after it. That pairs up with a .z1 of the same name. The hit shown below does not come from the same gene. It is nearly identical on the right side, but not on the left. You need to be careful, especially with short sequences, that you are not making chimeric or hybrid sequences when you assemble protein sequences from fragments.
>gi|10753840|gb|BF022507.1|BF022507 uy43g06.y1 NCI_CGAP_Lu30 Mus musculus cDNA clone IMAGE:3662362 5' similar to TR:O75746 O75746 ARALAR1 PROTEIN. ;. Length = 450 Score = 56.6 bits (135), Expect = 5e-08 Identities = 32/51 (62%), Positives = 39/51 (75%), Gaps = 8/51 (15%) Frame = +3 Query: 1 ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGV 43 A+E+G+V +LL AGA+AG+PA VIKTRLQVAARAGQTTY+GV Sbjct: 294 ADENGRVGGINLLTAGALAGVPAASLVTPADVIKTRLQVAARAGQTTYSGV 446That was an easy example of using the EST database to find a missing sequence fragment from a partial gene. Now we will try a harder problem with a larger gap, also in a mouse carrier. Go back to the sequence alignment link at the top of this page and search for -SFLAG in the alignment. You should find sequence 96 as shown below.
segment 3 94A ----------PRKSDGSGEAVFYWSLIAGLLSGMTSAFMVTPFDVVKTRLQADGEKK--------FKGIM 0 95 ------------FNELAGKASFAHSFVSGCVAGSIAAVAVTPLDVLKTRIQTLKKGLGE----DMYSGIT 0 96 ------------------------SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNE----DTYSGFL 0 96A ----------PRRNDGSGEAVFWCSFLAGLAAGSTAALAVNPFDVVKTRLQAIKKADGE----KEFKGIS 0This gap extends upstream over several pages of the alignment. For your convenience they are shown here with some surrounding sequence. This will be important later. Notice that sequence 95 and 96 in the first segment below are nearly 100% identical. Sequence 95 is a human sequence and it is the ortholog of the mouse sequence with the gap. That will be useful to us.
segment 1 94 GMYRGAAVNLTLVTPEKAIKLAANDFFRHQLS-KDG-----QKLTLLKEMLAGCGAGTCQVIVTTPMEML 112 94A GMYRGSAVNIVLITPEKAIKLTANDFFRYHLASDDG------VIPLSRATLAGGLAGLFQIVVTTPMELL 0 95 GMYRGAAVNLTLVTPEKAIKLAANDFFRRLLM-EDG-----MQRNLKMEMLAGCGAGMCQVVVTCPMEML 0 96 GMYRGAAVNLTLVTPEKAIKLAANDFLRH----------------------------------------- 0 96A GMYRGSGVNILLITPEKAIKLTANDYFRHKLTTKDG------KLPLTSQMVAGGLAGAFQIIVTTPMELL 0 segment 2 94 KIQLQDAGRIAAQRKILAPRSTATQLTRDLLRSRG-IAGLYKGLGATLLRDVPFSVVYFPLFANLNQLG- 0 94A KIQMQDAGRVDRAAGREVKTITALGLTKTLLRERG-IFGLYKGVGATGVRDITFSMVYFPLMAWINDQG- 0 95 KIQLQDAGRLAVHHQGSARRPSATLIAWELLRTQG-LAGLYRGLGATLLRDIPFSIIYFPLFANLNNLG- 0 96 ---------------------------------------------------------------------- 0 96A KIQMQDAGRVAKLAGKTVEKVSATQLASQLIKDKG-IFGLYKGIGATGLRDVTFSIIYFPLFATLNDLG- 0Try to do the same technique as before. Copy the beginning and end of sequence 96 and place them in the tblastn window. Select mouse ESTs and turn off the filter. Do the blast. The results will show hits that extend into the gap region.
>gi|11620519|gb|BF533156.1|BF533156 602073695F1 NCI_CGAP_Li9 Mus musculus cDNA clone IMAGE:4210775 5'. Length = 890 Score = 86.3 bits (212), Expect = 4e-17 Identities = 42/42 (100%), Positives = 42/42 (100%) Frame = +1 Query: 30 SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 71 SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL Sbjct: 124 SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 249 >gi|10379135|gb|BE861312.1|BE861312 UI-M-AK0-adi-c-11-0-UI.r1 NIH_BMAP_MHY Mus musculus cDNA clone UI-M-AK0-adi-c-11-0-UI 5'. Length = 538 Score = 65.5 bits (158), Expect = 8e-11 Identities = 39/73 (53%), Positives = 43/73 (58%), Gaps = 13/73 (17%) Frame = +3 Query: 1 GMYRGAAVNLTLVTPEKAIKLAANDFLRH-------------SFLAGCVAGSAAAVAVNP 47 GMYRGAAVNLTLVTPEKAIKLAANDFLR LAGC AG P Sbjct: 309 GMYRGAAVNLTLVTPEKAIKLAANDFLRQLLMQDGTQRNLKMEMLAGCGAGICQVGITCP 488 Query: 48 CDVVKTRLQSLER 60 +++K +LQ R Sbjct: 489 KEMLKIQLQDAGR 527The first hit above (BF533156) extends 123 bp into the gap upstream. You can see this from the numbering of the lower sequence starting at 124. To get this upstream sequence you must click on the accession number from the blast search and go to the sequence entry. Copy the sequence and take it to the quick protein translator from MBS (Molecular Biology Shortcuts). Paste it in the window and select three frames since you know it is in the plus strand and then translate it. Copy all three frames back to a word processor and comparte them to the mouse and the human sequenceabove it to find the correct frame. The DNA sequence of BF533156
GCCAACCTGAATCAGCTGGGCAAACAGGGGGAAGTAGACAATGGAGAAGGGAACATCCCC CTGTTTGCCAACCTGAATCAGCTGGGCCGCCCATCCTCTGAGGAGAAGTCGCCTTTCTAT GTGTCCTTCCTAGCAGGCTGCGTGGCTGGGAGTGCAGCCGCTGTGGCCGTCAACCCTTGT GATGTGGTGAAGACTCGGCTCCAGTCCCTTGAGAGAGGTGTTAATGAGGACACTTACTCT GGGTTTCTGGACTGTGCAAGGAAGATCTGGAGACATGAAGGTCCCTCAGCCTTCCTGAAA GGCGCATACTGCCGTGCGCTGGTCATTGCCCCGCTGTTTGGCATCGCCCAGGTGGTCTAC TTTCTGGGCATTGCCGAGTCCCTGCTGGGGCTGCTGCAAGAACCCCAGGCCTGAGCCCAT GGCTGCTTCTCTCCAGCCTATGGGCAGGGGCCAGAACAGGGTGACCAGCACAAGCCTGAG GAGGAGTGGTCTCTCCCCGGTCCTCCTCATTAAGATGGGAAGGCAAGGGGAGGGTGCAGG GTCCACATGGGTGATGCACACATAAGCCCCTGTGTGGTCCTGAAGGGACAACAAATGGGA TCGAGGTCTTATCTATGTAGAAAATGCAGAAATCTGTACATCCCTCAAGCCAGTTCTGTC CCATCCTTGTTACTCAAACCCAGTCCACTGGCTGAACACCCATGGGACAGAGCTGGTCTC TGGGTGGGGGCCCCAGGCCTGGTTTGGGAGGGGGACCTACCTGGGGTTCACTGGGCCTGG CCCTGGGGGCCCTGGCTTCCATAGGGGCCCACCCCCGATTTTTTGGGTTCCCCGCCAGGG GTCTCGGCCGGCGAGCTTGCCGGTGGTCCCCTCGACCTGTCCCCGCTTGG
At least from PLFAN the sequences match, but upstream they do not and this region must be considered suspect. You have now filled in part of the gap. The region between FANLNNLG and GCVAG is not very strong and may contain a frameshift. Watch for better matches in this region. Look at the other hit BE861312. This has sequence going into the gap from the other side. Compare the sequence with the human sequence.
Phase 1 ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC Phase 2 PT*ISWANRGK*TMEKGTSPCLPT*ISWAAHPLRRSRLSMCPS*QAAWLGVQPLWPSTL Phase 3 QPESAGQTGGSRQWRREHPPVCQPESAGPPIL*GEVAFLCVLPSRLRGWECSRCGRQPL Phase 1 mouse ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC ||||||| || | | || |||||| ||||| | YRGLGATLLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL Human 95
mouse QLLMQDGTQRNLKMEMLAGCGAGICQVGITCPKEMLKIQLQDAGR ||| || ||||||||||||||| ||| |||||||||||||||| RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR Human 95They are almost identical and so the frame is correct. This fills in much of the gap from the other side. There are only about 48 amino acids missing in the middle now. To try to fill this gap you can repeat the search of the EST database or you can go to another section to find the missing part, the HTGS section. I recommend this. You will find genomic DNA with introns here, so you need to have the human DNA to help identify the missing exon sequence. Construct a hybrid sequence with the new mouse sequence you just found and human from seq 95 to fill in the gap. Then blast the mouse HTGS section of Genbank. In the sequence below human is in lower case.
RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR lavhhqgsarrpsatliawellrtqglaglyrglgatllrdipfsiiyf PLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPLThe result from this search shows a good match which is probably to the same gene. There are some differences, however this is a genomic sequence and the gene is on a 90,000 bp contig that is probably more accurate than the EST sequences. There are also three different exons here (see the nucleotide numbering). One of the regions with the greatest number of sequence differences is the region that looked like it might contain frameshifts earlier. The differences on the 2nd exon are mouse human differnces.
>gb|AC084273.11|AC084273 Mus musculus clone rp23-313e8, WORKING DRAFT SEQUENCE, 5 unordered pieces Length = 182050 Score = 81.6 bits (200), Expect = 2e-15 Identities = 38/45 (84%), Positives = 40/45 (88%) Frame = -2 Query: 9 QRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGS 53 QRNLKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRL H S Sbjct: 142635 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLGEAHPNS 142501 Score = 58.5 bits (140), Expect = 2e-08 Identities = 30/34 (88%), Positives = 31/34 (90%) Frame = -3 Query: 55 RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIP 88 RRPSATLIA ELLRTQGL+GLYRGLGATLLR P Sbjct: 139970 RRPSATLIARELLRTQGLSGLYRGLGATLLR*AP 139869 Score = 92.4 bits (228), Expect = 1e-18 Identities = 44/53 (83%), Positives = 47/53 (88%) Frame = -3 Query: 83 LLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL 135 L RDIPFSIIYFPLFANLN LG +EL GKASF HSFV+GC AGS+AAVAVTPL Sbjct: 139628 LYRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPL 139470 Probable sequence after assembly QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXXXXXXX RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPLThe sequence is almost done now, but there is the small matter of an intron exon joint to be settled. The human sequence has a phase 1 intron after DAGRL and the mouse is expected to have the same intron location. However, the sequence between the human and mouse does not match here for 7 amino acids. It would be desirable to find an EST for mouse that bridged this gap. The following search shows that there is an EST that does this and this sequence indicates that the mouse sequence is longer than the human sequence from the alignment.
>gb|AI848503.1|AI848503 UI-M-AP1-agf-c-11-0-UI.s2 NIH_BMAP_MST_N Mus musculus cDNA clone UI-M-AP1-agf-c-11-0-UI 3'. Length = 438 Score = 200 bits (508), Expect = 5e-51 Identities = 106/131 (80%), Positives = 106/131 (80%), Gaps = 18/131 (13%) Frame = -3 Query: 1 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXX------------------XX 42 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRL Sbjct: 415 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGST 236 Query: 43 XXXRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 102 RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK Sbjct: 235 STHRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 56 Query: 103 ASFTHSFVAGC 113 ASFTHSFVAGC Sbjct: 55 ASFTHSFVAGC 23This actually identifies a flaw in the alignment where a part of the human gene was left out because it was too long to fit into the existing alignment. The match below is human on the bottom and mouse on the top, showing the extra sequence also exists in a human EST.
>gb|BF329616.1|BF329616 CM2-BN0273-080600-227-b08 BN0273 Homo sapiens cDNA. Length = 298 Score = 167 bits (423), Expect = 9e-41 Identities = 84/95 (88%), Positives = 88/95 (92%) Frame = -3 Query: 4 LKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGSTSTH 63 LKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRLAV HQ SASA TSR Y+TGS STH Sbjct: 287 LKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGSASAPSTSRSYTTGSASTH 108 Query: 64 RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPF 98 RRPSATLIA ELLRTQGL+GLYRGLGATLLRDIPF Sbjct: 107 RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIPF 3This effort now successfully fills in the gap in mouse sequence 96.