Keep It Simple, Stupid | |
PerlMonks |
Re: Complicated pattern matchby scain (Curate) |
on Jan 21, 2003 at 04:03 UTC ( [id://228578]=note: print w/replies, xml ) | Need Help?? |
OK, so there have been several responses so far, and it appears that at least some of them work :-)
In the spirit of TMTOWTDI, I thought I would take a different approach. The problems dr_jgbn presents could be used to do several things, though the most obvious is to look for exons that are included in one transcript and not another. Also, the concern that Abagail-II raises is valid, therefore, I think it is reasonable and desireable to use industry standard analysis tools to arrive at a solution. What follows is a fairly long description of how I solved that problem.
First, one starts with two sequences that are related to one another. Typically, one would determine that with BLASTN. For this example, I used two sequences that are clearly related:
LOCUS AF418600 3389 bp mRNA linear PRI 30-DEC-2002 DEFINITION Homo sapiens ATP-binding cassette protein C13 (ABCC13) mRNA, complete cds. ACCESSION AF418600 VERSION AF418600.1 GI:19851883 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3389) AUTHORS Yabuuchi,H., Takayanagi,S.-I., Yoshinaga,K., Taniguchi,N., Aburatani,H. and Ishikawa,T. TITLE ABCC13, an unusual truncated ABC transporter, is highly expressed in fetal human liver JOURNAL Biochem. Biophys. Res. Commun. 299 (3), 410-417 (2002) MEDLINE 22334951 PUBMED 12445816 REFERENCE 2 (bases 1 to 3389) AUTHORS Yabuuchi,H., Takayanagi,S.-I. and Ishikawa,T. TITLE Direct Submission JOURNAL Submitted (07-SEP-2001) Biomolecular Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8501, Japan FEATURES Location/Qualifiers source 1..3389 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="21" /map="21q11.2" /tissue_type="placenta" gene 1..3389 /gene="ABCC13" CDS 294..1271 /gene="ABCC13" /note="transporter" /codon_start=1 /product="ATP-binding cassette protein C13" /protein_id="AAL99903.1" /db_xref="GI:19851884" /translation="MLSSTQNAGGSYQRVRGALDTQKCSPEKSASFFSKVTYSWFSRV ITLGYKRPLEREDLFELKESDSFCTACPIFEKQWRKEVLRNQERQKVKVSCYKEAHIK KPSLLYALWNTFKSILIQVALFKVFADILSFTSPLIMKQIIIFCEHSSDFGWNGYGYA VALLVVVFLQTLILQQYQRFNMLTSAKVKTAVNGLIYKKVSLATLCVYFLLDEGNILT ATKVFTSMSLFNILRIPLFELPTVISAVVQTKISLGRLEDFLNTEELLPQSIETNYTG DHAIGFTDASFSWDKTGMPVLKEALWLMFLSRPGFRIAFCKKTFSLAPS" BASE COUNT 1087 a 639 c 698 g 965 t ORIGIN 1 ttcccgtctc cctccacacc ttgcctcctc caggcccaac cccattccac cagcaccccg 61 ccccaggcgc actggctagg actggaaggc aaggactaag gtgaaagctg actgcccgca 121 ccgccccagg cgcactgggt aggactgaaa ggcagggact aaggtgacag ctgactctgc 181 cggggcggag tgcgtgtcct cccaccgtgc tggcgctgaa ctgactgtcc gctgccaagg 241 gaagtgacag ccgcagccgg gctctcagcc agcggccggg cgcccagcgg accatgctct 301 ccagtacgca gaacgcgggc ggctcctatc agcgggtccg cggggcgctt gatacacaga 361 aatgcagtcc agagaaaagt gcctcatttt tcagtaaagt gacatattcc tggtttagca 421 gagtaattac tttaggctat aagagacctt tggaaagaga ggatcttttt gaactaaagg 481 aaagtgattc cttctgcact gcgtgtccca tctttgaaaa acaatggaga aaggaagttt 541 taaggaatca agagaggcaa aaagtaaagg tatcttgtta taaagaggca catatcaaga 601 aaccatctct actctatgca ttgtggaaca cctttaaatc catcctgatt caagttgcct 661 tattcaaagt gtttgctgat attttgtcct tcactagccc actcataatg aagcaaatta 721 tcattttctg tgaacacagc tcagattttg gctggaatgg ctatggctat gcagtggcac 781 ttcttgttgt agtctttttg caaactctga ttcttcagca atatcaacgt tttaacatgc 841 tcacctcagc aaaagttaag acagctgtaa atggactgat ctacaaaaag gtgtccctgg 901 caaccttgtg tgtctatttc ttactggatg aaggaaacat tttaacagcc actaaagtgt 961 tcacatcgat gtctttgttt aatattttaa ggattcctct gtttgagtta ccaaccgtga 1021 tctcagctgt ggtccagaca aagatatccc tgggccgttt ggaagacttt ctcaacactg 1081 aggagcttct tcctcaaagt attgaaacga actatacagg agatcatgct attgggttta 1141 cagatgcttc tttctcctgg gataaaacag gaatgccagt tctaaaagag gctctgtggc 1201 ttatgtttct cagcaggcct ggattcagaa ttgcattttg caagaaaaca ttctctttgg 1261 ctccatcatg aagaaagagt tttatgagca agtattggaa gcctgtgctc tccttccaga 1321 tttggagcag ttgccaaagg gagatcaaac tgagattgga gaaagagtaa ggaaacggct 1381 gtgaatataa gtgggggcca gcagcatcga gtaagcctgg ccagagctgt ctacagtggg 1441 gccgacgtct acctcctgga tgatcccttg tctgctattg atgttcatgt tggaaagcag 1501 ctttttgaaa aagtgatagg gtccttgggc cttttgaaaa accgggtagc catagtttca 1561 tttatgcttt cttgtggggc aggggatata tctagcactg tgcctaaaat tccatttatt 1621 ttgtttattt gttgaagcaa agtagtcaca tttcacatga aaaaaatggg atgctcagat 1681 tgcaatttca caggcaaatg tcaatgggta atttttgaag ttttacttac aagtttactc 1741 tggtgacaat ttatgtgatt gttatctctt catctaatac catatcctaa ggggatgtat 1801 gtgccataat gaattgttgt tgactcaagt gatttttaca atggtatttt ggtgcacacc 1861 aaaatattaa taaaatttta tattattaac atagcaattt gacattatta tcacagcaat 1921 catagtgatt cagctgtttt gttaaagggc caagaagatg aaaattagtg tttaccagta 1981 catgccatat ttcacagggt aaattagttc aatccctatt tcacacataa aggagctcag 2041 ggctcacaat gacttgatat aggttgctca acttaagaac aatagagatt tttaaaagta 2101 tgtctgactc caaagccaca ttcttcataa tatgtgttta tggaattatt aatatagaga 2161 actttgaaaa tatatgtata aaattataaa atatatgtct cagtaaaaat aaacaaacat 2221 ctttatatat tgacataaac ccaagattaa aaatgtcaca cttaattgta tgtatatgta 2281 taaaataata tgaatgattt aaaaataagg actatgaatt atgacttgga ttaaacactt 2341 tttttgtaat ggtactgtat atttaaaaat aagcaagaat gacatatatg aaattgtttt 2401 aaatttgcaa ttaaattgat tcatttctaa tgaacgaaaa aatacccgaa ctaaagtgtt 2461 aaaaatcaca tttagaaact attttctcat ctagcagcat gtcattattc cttgtaaaat 2521 acatagatgc tgccttaaac tggcacctac caaaaatcat actctgaaat agtaaaggaa 2581 aaaagtcatt gtagcttttt ttaaaaagcc atttttcgta atctctgaga taaaacactt 2641 tgtttttcag actcatattt tagtgacaca caatctcaca cttctgccac aaatgaatct 2701 tattgttgta atgaaaagtg gcagaatagc tcaaatggga atataccagg agctgctgtg 2761 taaaaccaaa aatctcacta atttgcacca agtcatcagt gaacaagaaa aagctcatgc 2821 tctaaagcaa gtcagtgcaa tcaactccag aactagacca aaagacaaaa tcctggagca 2881 aaaacatagt gggatatgag tccccttctg agagtcatgc aaggatttct aggcactgtt 2941 aagaacccac gtagaaagaa gatgttaagt gacggtagaa ccagttttca gggttgcctg 3001 attttttgaa gcagcacttc acagaagtcg aggtgaccag gattgacatg caagttctaa 3061 gagacaggcc ctcattggat caaggaaaac agctctcaat gaaaaaagaa aagatccctg 3121 ttggtgggtt gaagttctcc atcattctgc agtacctcca agcctttggc tggctctggg 3181 tgtggctgac tgtggtcact tacttagggc agaatttggt gagcattgga cagaatctat 3241 ggcttagtgc atgggcaaaa gaagcaaaaa acatgaatga gttcacagaa tggaagcaaa 3301 taagaagcaa taaacttaat atctatggat tattgggatt aataaaaggt aaaagaaaat 3361 gtaaaaaaaa aaaaaaaaaa aaaaaaaaa //and LOCUS ABCC13 1073 bp mRNA linear PRI 23-DEC-2002 DEFINITION Homo sapiens ATP-binding cassette, sub-family C (CFTR/MRP), member 13 (ABCC13), transcript variant 3, mRNA. ACCESSION NM_172026 VERSION NM_172026.1 GI:25952076 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1073) AUTHORS Allikmets,R., Gerrard,B., Hutchinson,A. and Dean,M. TITLE Characterization of the human ABC superfamily: isolation and mapping of 21 new genes using the expressed sequence tags database JOURNAL Hum. Mol. Genet. 5 (10), 1649-1655 (1996) MEDLINE 97049974 PUBMED 8894702 REFERENCE 2 (bases 1 to 1073) AUTHORS Yabuuchi,H., Takayanagi,S., Yoshinaga,K., Taniguchi,N., Aburatani,H. and Ishikawa,T. TITLE ABCC13, an unusual truncated ABC transporter, is highly expressed in fetal human liver JOURNAL Biochem. Biophys. Res. Commun. 299 (3), 410-417 (2002) MEDLINE 22334951 PUBMED 12445816 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from AF518320.1 and AF418600.1. Summary: The protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, and White). This ABC transporter is a member of the MRP subfamily which is involved in multi-drug resistance. Alternative splicing of this gene results in multiple transcript variants; however, not all variants have been fully described. Transcript Variant: This variant (3) lacks a segment in the coding region, compared to transcript variant 1. This results in a longer protein (isoform c) compared to isoform a. This variant (3) has been alternatively referred to as transcript variant C. COMPLETENESS: complete on the 3' end. FEATURES Location/Qualifiers source 1..1073 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="21" /map="21q11.1" gene 1..1073 /gene="ABCC13" /note="synonyms: PRED6, C21orf73" /db_xref="LocusID:150000" CDS 294..1031 /gene="ABCC13" /note="ATP-binding cassette transporter C13; multidrug resistance associated protein-like" /codon_start=1 /product="ATP-binding cassette protein C13 isoform c" /protein_id="NP_742023.1" /db_xref="GI:25952077" /db_xref="LocusID:150000" /translation="MLSSTQNAGRLIHRVITLGYKRPLEREDLFELKESDSFCTACPI FEKQWRKEVLRNQERQKVKVSCYKEAHIKKPSLLYALWNTFKSILIQVALFKVFADIL SFTSPLIMKQIIIFCEHSSDFGWNGYGYAVALLVVVFLQTLILQQYQRFNMLTSAKVK TAVNGLIYKKALLLSNVSRQKFSTGEIINLMSATHGLDSKPQSPLVCPFSNPNGRISP LARAGSSSVSRGGSPCVCYTNKCFSCN" misc_feature 426..869 /gene="ABCC13" /note="SunT; Region: ABC-type bacteriocin/lantibiotic exporters, contain an N-terminal double-glycine peptidase domain Defense mechanisms" /db_xref="CDD:COG2274" misc_feature 516..869 /gene="ABCC13" /note="MdlB; Region: ABC-type multidrug transport system, ATPase and permease components Defense mechanisms" /db_xref="CDD:COG1132" misc_feature 558..869 /gene="ABCC13" /note="ABC_membrane; Region: ABC transporter transmembrane region. This family represents a unit of six transmembrane helices. Many members of the ABC transporter family (pfam00005) have two such regions" /db_xref="CDD:pfam00664" variation 889 /gene="ABCC13" /allele="G" /allele="A" /db_xref="dbSNP:2822558" polyA_signal 1032..1037 /gene="ABCC13" polyA_site 1052 /gene="ABCC13" BASE COUNT 306 a 261 c 238 g 268 t ORIGIN 1 ttcccgtctc cctccacacc ttgcctcctc caggcccaac cccattccac cagcaccccg 61 ccccaggcgc actggctagg actggaaggc aaggactaag gtgaaagctg actgcccgca 121 ccgccccagg cgcactgggt aggactgaaa ggcagggact aaggtgacag ctgactctgc 181 cggggcggag tgcgtgtcct cccaccgtgc tggcgctgaa ctgactgtcc gctgccaagg 241 gaagtgacag ccgcagccgg gctctcagcc agcggccagg cgccccgcgg accatgctct 301 ccagtacgca gaacgcgggg cgcttgatac acagagtaat tactttaggc tataagagac 361 ctttggaaag agaggatctt tttgaactaa aggaaagtga ttccttctgc actgcgtgtc 421 ccatctttga aaaacaatgg agaaaggaag ttttaaggaa tcaagagagg caaaaagtaa 481 aggtatcttg ttataaagag gcacatatca agaaaccatc tctactctat gcattgtgga 541 acacctttaa atccatcctg attcaagttg ccttattcaa agtgtttgct gatattttgt 601 ccttcactag cccactcata atgaagcaaa ttatcatttt ctgtgaacac agctcagatt 661 ttggctggaa tggctatggc tatgcagtgg cacttcttgt tgtagtcttt ttgcaaactc 721 tgattcttca gcaatatcaa cgttttaaca tgctcacctc agcaaaagtt aagacagctg 781 taaatggact gatctacaaa aaggccttac ttttatcaaa tgtttctcga caaaagtttt 841 ccactgggga aattattaac ttgatgtcag caactcatgg acttgacagc aaacctcaat 901 ctcctctggt ctgccccttt tcaaatccta atggccgtat atctcctttg gcaagagctg 961 ggtccagcag tgttagcagg ggtggcagtc cttgtgtttg ttataccaat aaatgcttta 1021 gctgcaacta aaataaaaaa gttaaaggta agaaaaaaaa aaaaaaaaaa aaa //Using these as starting sequences, I wrote a script that does a bl2seq (blast two sequences against each other), then look at the results, determine what is missing with respect to each sequence. The bl2seq output is shown here: Query= a.seq (3389 letters) >b.seq Length = 1073 Score = 938 bits (473), Expect = 0.0 Identities = 473/473 (100%) Strand = Plus / Plus Query: 419 cagagtaattactttaggctataagagacctttggaaagagaggatctttttgaactaaa 478 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 332 cagagtaattactttaggctataagagacctttggaaagagaggatctttttgaactaaa 391 Query: 479 ggaaagtgattccttctgcactgcgtgtcccatctttgaaaaacaatggagaaaggaagt 538 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 392 ggaaagtgattccttctgcactgcgtgtcccatctttgaaaaacaatggagaaaggaagt 451 Query: 539 tttaaggaatcaagagaggcaaaaagtaaaggtatcttgttataaagaggcacatatcaa 598 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 452 tttaaggaatcaagagaggcaaaaagtaaaggtatcttgttataaagaggcacatatcaa 511 Query: 599 gaaaccatctctactctatgcattgtggaacacctttaaatccatcctgattcaagttgc 658 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 512 gaaaccatctctactctatgcattgtggaacacctttaaatccatcctgattcaagttgc 571 Query: 659 cttattcaaagtgtttgctgatattttgtccttcactagcccactcataatgaagcaaat 718 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 572 cttattcaaagtgtttgctgatattttgtccttcactagcccactcataatgaagcaaat 631 Query: 719 tatcattttctgtgaacacagctcagattttggctggaatggctatggctatgcagtggc 778 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 632 tatcattttctgtgaacacagctcagattttggctggaatggctatggctatgcagtggc 691 Query: 779 acttcttgttgtagtctttttgcaaactctgattcttcagcaatatcaacgttttaacat 838 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 692 acttcttgttgtagtctttttgcaaactctgattcttcagcaatatcaacgttttaacat 751 Query: 839 gctcacctcagcaaaagttaagacagctgtaaatggactgatctacaaaaagg 891 ||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 752 gctcacctcagcaaaagttaagacagctgtaaatggactgatctacaaaaagg 804 Score = 617 bits (311), Expect = e-180 Identities = 317/319 (99%) Strand = Plus / Plus Query: 1 ttcccgtctccctccacaccttgcctcctccaggcccaaccccattccaccagcaccccg 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1 ttcccgtctccctccacaccttgcctcctccaggcccaaccccattccaccagcaccccg 60 Query: 61 ccccaggcgcactggctaggactggaaggcaaggactaaggtgaaagctgactgcccgca 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 61 ccccaggcgcactggctaggactggaaggcaaggactaaggtgaaagctgactgcccgca 120 Query: 121 ccgccccaggcgcactgggtaggactgaaaggcagggactaaggtgacagctgactctgc 180 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 121 ccgccccaggcgcactgggtaggactgaaaggcagggactaaggtgacagctgactctgc 180 Query: 181 cggggcggagtgcgtgtcctcccaccgtgctggcgctgaactgactgtccgctgccaagg 240 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 181 cggggcggagtgcgtgtcctcccaccgtgctggcgctgaactgactgtccgctgccaagg 240 Query: 241 gaagtgacagccgcagccgggctctcagccagcggccgggcgcccagcggaccatgctct 300 ||||||||||||||||||||||||||||||||||||| ||||||| |||||||||||||| Sbjct: 241 gaagtgacagccgcagccgggctctcagccagcggccaggcgccccgcggaccatgctct 300 Query: 301 ccagtacgcagaacgcggg 319 ||||||||||||||||||| Sbjct: 301 ccagtacgcagaacgcggg 319 Score = 44.1 bits (22), Expect = 2e-07 Identities = 22/22 (100%) Strand = Plus / Plus Query: 339 cgcggggcgcttgatacacaga 360 |||||||||||||||||||||| Sbjct: 314 cgcggggcgcttgatacacaga 335 Lambda K H 1.37 0.711 1.31 Gapped Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Hits to DB: 6 Number of Sequences: 0 Number of extensions: 6 Number of successful extensions: 6 Number of sequences better than 1.0e-04: 1 Number of HSP's better than 0.0 without gapping: 1 Number of HSP's successfully gapped in prelim test: 0 Number of HSP's that attempted gapping in prelim test: 0 Number of HSP's gapped (non-prelim): 4 length of query: 3389 length of database: 1073 effective HSP length: 10 effective length of query: 3379 effective length of database: 1063 effective search space: 3591877 effective search space used: 3591877 T: 0 A: 0 X1: 6 (11.9 bits) X2: 15 (29.7 bits) S1: 12 (24.3 bits) S2: 18 (36.2 bits)Finally, the code is below. It uses Bio::Tools::BPbl2seq, which is part of BioPerl. Note that the code could do several more things, like using BioPerl tools to do the blast on a remote server and turning the sequences into Bio::Seq objects, so that object methods could be used. For this example, that didn't seem necessary. There are also analysis options that one might want to consider; I'll address those after the output. So, ignoring the two end pieces that are different, there appear to be two short pieces of sequence A that contains sequence that is not contained in sequence B. There are several caveats that go along with this data. First, because BLAST filters out simple repeats, it is possible that some edges are missing. Also, there are frequently overlaps around the edges of BLAST hits, since the junctions around exons frequently look similar. So a more detailed analysis of the found exons is usually required. At a minimum, that analysis should including looking for likely splice sites. If the genome is available, the found exons could be blasted against the genome and the up and down stream regions could be cropped out and run through a splice finder. Along with the splice sites, it is frequently useful to look at the resulting translation into protein to see how the predicted splicing would affect the protein (ie, does it result in early termination, does it include or exclude a potentially important motif, etc).
Scott
In Section
Seekers of Perl Wisdom
|
|