sharkbait has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I have a problem (repeated in comments). I have a search string $mystring that I am searching for a matching string against a NCBI text file, nothing complicated, basic regex. The only slight modification is when I find a match, I need the string immediately following the match, not the match itself. The problem is I'm stuck. I've narrowed it down to maybe a regex issue? But basically, the files, basic text files, load, but the input from the search pattern (it's in the code as a CSV because initially I set it as such but got fed up fighting with Text::CSV so I just used text instead) doesn't seem to recognize each line of the array as a string. When I do an if loop to debug, Perl doesn't seem to recognize search patterns as strings. Below are sample text copied and pasted. I'd greatly appreciate any help! Thanks and Happy Mardi Gras!!

BB_B10 BB_B29 BB_B18 BB_B13 BB_B14 BB_B12 BB_B04 BB_B16 BB_B22 BB_B17 BB_B27 BB_B19 BB_B07 BB_B23 BB_B09 BB_B02 BB_B28 BB_B24 BB_B03 BB_B05 BB_B06
/locus_tag="BB_B01" /db_xref="GeneID:1194411" CDS 46..324 /locus_tag="BB_B01" /note="catalyzes the hydrolysis of acylphosphate" /codon_start=1 /transl_table=11 /product="acylphosphatase" /protein_id="NP_046987.2" /db_xref="GI:364556794" /db_xref="GeneID:1194411" /translation="MYKQQYFISGKVQGVGFRFFTEQIANNMKLKGFVK +NLNDGRVEI VAFFNTKEQMKKFEKLLNGNKYSNIKNIEKIVLDENYPFQFNDFKIYY" misc_feature 46..321 /locus_tag="BB_B01" /note="acylphosphatase; Provisional; Region: PRK1 +4432" /db_xref="CDD:184678" gene complement 308..751 /locus_tag="BB_B02" /db_xref="GeneID:1194420" CDS complement 308..751 /locus_tag="BB_B02" /note="hypothetical protein; identified by Glimme +r; putative" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_046988.1" /db_xref="GI:11497029" /db_xref="GeneID:1194420" /translation="MKIGPHYFFKKILKSNDNRTIYISYLYDRLASVKP +AGEWLRIYF KDSKRGKKYFILFNRNSSNGSFISCSFLKTSCNCGLDIKFSDGNLNIFC +RNRKSLEFL KFKVEHFFRTSVSCYKNNNSYVHNIKPKNKVKVLVKREASPNNKF" gene complement 837..2186 /locus_tag="BB_B03" /db_xref="GeneID:1194419" CDS complement 837..2186 /locus_tag="BB_B03" /note="hypothetical protein; identified by Glimme +r; putative" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_046989.1" /db_xref="GI:11497028" /db_xref="GeneID:1194419" /translation="MPPKVKIKNDFEIFRKELEILYKKYLNNELSYLKL +KEKLKILAE NHKAILFRKDKFTNRSIILNLSKTRKIIKEYINLSVIERIRRDNTFLFF +WKSRRIKEL KNIGIKDRKKIEELIFSNQMNDEKSYFQYFIDLFVTPKWLNDYAHKYKI +EKINSYRKE QIFVKINLNTYIEIIKLLLNQSRDIRLKFYGVLMAIGRRPVEVMKLSQF +YIADKNHIR MEFIAKKRENNIVNEVVFPVFADPELIINSIKEIRYMEQTENLTKEIIS +SNLAYSYNR LFRQIFNNIFAPEESVYFCRAIYCKFSYLAFAPKNMEMNYWITKVLGHE +PNDITTAFH YNRYVLDNLDDKADNSLLTLLNQRIYTYVRRKATYSTLTMDRLESLIKE +HHIFDDNYI KTLIVIKNLMLKDNLETLAMVRGLNVKIRKAFKATYGYNYNYIKLTEYL +SIIFNYKL" gene complement 2476..3798 /locus_tag="BB_B04" /db_xref="GeneID:1194410" CDS complement 2476..3798 /locus_tag="BB_B04"
#!/usr/bin/perl -w use Text::CSV; my @arrayOfVals; open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gbk.txt") +|| die "Could not open $!"; while (<$tmp>) { chomp; push(@arrayOfVals, $_); } close($tmp); my @arrayFromCSV; open(my $tmpFile, "<", "/Users/bioinformatics/Desktop/cp26_diffexpr.tx +t") || die "Could not open $!"; while (<$tmpFile>) { chomp; push(@arrayFromCSV, $_); } close($tmpFile); foreach(@arrayFromCSV) { if ($_ eq "BB_B10") { print "MAtch!!\n"; } else {print "$_\n";} } my @wkArray = @arrayOfVals; my @secArrayFromCSV = @arrayFromCSV; while (my $matchCheck = shift @wkArray) { $csvVal = shift(@secArrayFromCSV); if ( $csvVal =~ /$matchCheck/) { my $geneValue = shift(@wkArray); print "$geneValue\n"; } else { print "no match"; } }

Replies are listed 'Best First'.
Re: Need Help with Maybe a Regex Issue
by hdb (Monsignor) on Feb 27, 2014 at 07:21 UTC

    Under the assumption that the strings you are looking for are always the values of the /locus_tag, it is better to load the values you are searching for into a hash to use the exists function in addition to a regex. So you could loop over your second file, extract the value of the /locus_tag and if it exists in the hash, print the next line.

    use strict; use warnings; my %hashOfVals; open(my $tmp, "<", "NC_001903.gbk.txt") || die "Could not open $!"; undef @hashOfVals{map {chomp; $_} <$tmp>}; close($tmp); open(my $tmpFile, "<", "cp26_diffexpr.txt") || die "Could not open $!" +; my @arrayFromCSV = <$tmpFile>; chomp @arrayFromCSV; close($tmpFile); for (0..$#arrayFromCSV-1) { print "$1: $arrayFromCSV[$_+1]\n" if $arrayFromCSV[$_] =~ /locus_tag="(.*)"/ and exists $hashOfV +als{$1}; }

      Thanks for the reply! The code compiled and ran, but produced no output. I ran into the same problem with the C-style for loops (deleted now from the code, copied and pasted below). I wasn't completely sure if my problem was a regex issue, a problem in the implementation, or something simple I missed?

      for (my $j=0; $j<$#arrayFromCSV; $j++) { for (my $i=0; $i<$#wkArray; $i++) { if ("$arrayFromCSV[$j]" =~ /$wkArray[$i]/g) { print "$wkArray[$i+1]\n"; } } }
Re: Need Help with Maybe a Regex Issue
by graff (Chancellor) on Feb 27, 2014 at 04:20 UTC
    Based on what you seem to be trying to do in your third while loop (shifting values off of @wkArray), I think you actually want to use hashes instead of arrays.

    As it is, your code assumes that the two input files have matching elements in matching order. Do you know that the two files are already sorted in the same way, and actually have matching contents?

    It looks like you tried to include sample data in your post. Please edit that to put "code" tags around the data. In fact, you could include an adequate sample of data as part of your posted code, like this:

    (start with a <code> tag, of course, or just <c>) # your sample code # goes here… (but please just include what you're actually using) # when it's time to read the sample data in your code, do this: while (<DATA>) { # do stuff with the sample data } # more code… enough to exit cleanly with coherent output… __DATA__ your sample data goes here… (don't forget the closing "code" tag…)
    Of course, if there are supposed to be two different input files, I think only one of them can be "faked" with the __DATA__ trick, and you'll need to post a separate "code" block for the other data file.
      Thanks! Sorry the data files weren't formatted better, I'll fix that now. The first file is essentially a list of strings, and the second is a list of line-by-line strings.

      Ok, I got it to work (many thank again to hdb, Kenosis, and graff). I found a format from the NCBI called a GFF file worked much better for parsing. However, I couldn't get the suggested hash function to work, so my updated question is how would I get this to work using a hash instead and more resemble a Perl script than a cobbled Perl/C/something construct? As it stands, it functions, but it's inefficient and not elegant. (Truncated) Sample files also attached. EDIT: I made one more modification, I finally got the hang of using a hash.

      ##gff-version 3 #!gff-spec-version 1.20 #!processor NCBI annotwriter ##sequence-region NC_001903.1 1 26498 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2 +24326 NC_001903.1 RefSeq region 1 26498 . + . ID=id0 +;Name=cp26;Dbxref=taxon:224326;Is_circular=true;gbkey=Src;genome=plas +mid;mol_type=genomic DNA;plasmid-name=cp26;strain=B31 NC_001903.1 RefSeq gene 46 324 . + . ID=gene0; +Name=BB_B01;Dbxref=GeneID:1194411;gbkey=Gene;locus_tag=BB_B01 NC_001903.1 RefSeq CDS 46 324 . + 0 ID=cds0;Na +me=NP_046987.2;Parent=gene0;Note=catalyzes the hydrolysis of acylphos +phate;Dbxref=Genbank:NP_046987.2,GeneID:1194411;gbkey=CDS;product=acy +lphosphatase;protein_id=NP_046987.2;transl_table=11 NC_001903.1 RefSeq gene 308 751 . - . ID=gene1 +;Name=BB_B02;Dbxref=GeneID:1194420;gbkey=Gene;locus_tag=BB_B02 NC_001903.1 RefSeq CDS 308 751 . - 0 ID=cds1;N +ame=NP_046988.1;Parent=gene1;Note=hypothetical protein%3B identified +by Glimmer%3B putative;Dbxref=Genbank:NP_046988.1,GeneID:1194420;gbke +y=CDS;product=hypothetical protein;protein_id=NP_046988.1;transl_tabl +e=11 NC_001903.1 RefSeq gene 837 2186 . - . ID=gene +2;Name=BB_B03;Dbxref=GeneID:1194419;gbkey=Gene;locus_tag=BB_B03 NC_001903.1 RefSeq CDS 837 2186 . - 0 ID=cds2; +Name=NP_046989.1;Parent=gene2;Note=hypothetical protein%3B identified + by Glimmer%3B putative;Dbxref=Genbank:NP_046989.1,GeneID:1194419;gbk +ey=CDS;product=hypothetical protein;protein_id=NP_046989.1;transl_tab +le=11 NC_001903.1 RefSeq gene 2476 3798 . - . ID=gen +e3;Name=BB_B04;Dbxref=GeneID:1194410;gbkey=Gene;locus_tag=BB_B04 NC_001903.1 RefSeq CDS 2476 3798 . - 0 ID=cds3 +;Name=NP_046990.2;Parent=gene3;Note=similar to GB:U07818 PID:466474 S +P:Q45400 percent identity: 36.93%3B identified by sequence similarity +%3B putative;Dbxref=Genbank:NP_046990.2,GeneID:1194410;gbkey=CDS;prod +uct=chitibiose transporter protein ChbC;protein_id=NP_046990.2;transl +_table=11 NC_001903.1 RefSeq gene 4084 4431 . + . ID=gen +e4;Name=BB_B05;Dbxref=GeneID:1194409;gbkey=Gene;locus_tag=BB_B05 NC_001903.1 RefSeq CDS 4084 4431 . + 0 ID=cds4 +;Name=NP_046991.1;Parent=gene4;Note=similar to SP:P46319 PID:895750 P +ID:1783268 GB:AL009126 percent identity: 27.62%3B identified by seque +nce similarity%3B putative;Dbxref=Genbank:NP_046991.1,GeneID:1194409; +gbkey=CDS;product=chitibiose transporter protein ChbA;protein_id=NP_0 +46991.1;transl_table=11 NC_001903.1 RefSeq gene 4482 4757 . + . ID=gen +e5;Name=BB_B06;Dbxref=GeneID:1194408;gbkey=Gene;locus_tag=BB_B06 NC_001903.1 RefSeq CDS 4482 4757 . + 0 ID=cds5 +;Name=NP_046992.2;Parent=gene5;Note=similar to SP:P46318 PID:895748 P +ID:1783266 GB:AL009126 percent identity: 46.32%3B identified by seque +nce similarity%3B putative;Dbxref=Genbank:NP_046992.2,GeneID:1194408; +gbkey=CDS;product=chitibiose transporter protein ChbB;protein_id=NP_0 +46992.2;transl_table=11 NC_001903.1 RefSeq gene 4769 5866 . + . ID=gen +e6;Name=BB_B07;Dbxref=GeneID:1194407;gbkey=Gene;locus_tag=BB_B07 NC_001903.1 RefSeq CDS 4769 5866 . + 0 ID=cds6 +;Name=NP_046993.1;Parent=gene6;Note=similar to GB:M88764 SP:Q09090 PI +D:469166 PID:1199570 percent identity: 21.29%3B identified by sequenc +e similarity%3B putative;Dbxref=Genbank:NP_046993.1,GeneID:1194407;gb +key=CDS;product=alpha3-beta1 integrin-binding protein;protein_id=NP_0 +46993.1;transl_table=11 NC_001903.1 RefSeq gene 5888 6517 . - . ID=gen +e7;Name=BB_B08;Dbxref=GeneID:1194418;gbkey=Gene;locus_tag=BB_B08;pseu +do=true NC_001903.1 RefSeq gene 6677 7714 . + . ID=gen +e8;Name=BB_B09;Dbxref=GeneID:1194417;gbkey=Gene;locus_tag=BB_B09 NC_001903.1 RefSeq CDS 6677 7714 . + 0 ID=cds7 +;Name=NP_046995.1;Parent=gene8;Note=hypothetical protein%3B identifie +d by Glimmer%3B putative;Dbxref=Genbank:NP_046995.1,GeneID:1194417;gb +key=CDS;product=hypothetical protein;protein_id=NP_046995.1;transl_ta +ble=11 NC_001903.1 RefSeq gene 7836 8765 . + . ID=gen +e9;Name=BB_B10;Dbxref=GeneID:1194406;gbkey=Gene;locus_tag=BB_B10 NC_001903.1 RefSeq CDS 7836 8765 . + 0 ID=cds8 +;Name=NP_046996.1;Parent=gene9;Note=similar to GP:1655797 percent ide +ntity: 33.68%3B identified by sequence similarity%3B putative;Dbxref= +Genbank:NP_046996.1,GeneID:1194406;gbkey=CDS;product=hypothetical pro +tein;protein_id=NP_046996.1;transl_table=11 NC_001903.1 RefSeq gene 8781 9299 . + . ID=gen +e10;Name=BB_B11;Dbxref=GeneID:1194405;gbkey=Gene;locus_tag=BB_B11;pse +udo=true NC_001903.1 RefSeq gene 9275 10036 . + . ID=ge +ne11;Name=BB_B12;Dbxref=GeneID:1194404;gbkey=Gene;locus_tag=BB_B12 NC_001903.1 RefSeq CDS 9275 10036 . + 0 ID=cds +9;Name=NP_046998.1;Parent=gene11;Note=similar to GP:2182756 percent i +dentity: 46.00%3B identified by sequence similarity%3B putative;Dbxre +f=Genbank:NP_046998.1,GeneID:1194404;gbkey=CDS;product=hypothetical p +rotein;protein_id=NP_046998.1;transl_table=11 NC_001903.1 RefSeq gene 10104 10652 . + . ID=g +ene12;Name=BB_B13;Dbxref=GeneID:1194403;gbkey=Gene;locus_tag=BB_B13 NC_001903.1 RefSeq CDS 10104 10652 . + 0 ID=cd +s10;Name=NP_046999.1;Parent=gene12;Note=similar to GB:U03641 PID:4582 +18 percent identity: 42.59%3B identified by sequence similarity%3B pu +tative;Dbxref=Genbank:NP_046999.1,GeneID:1194403;gbkey=CDS;product=hy +pothetical protein;protein_id=NP_046999.1;transl_table=11 NC_001903.1 RefSeq gene 10920 11417 . - . ID=g +ene13;Name=BB_B14;Dbxref=GeneID:1194402;gbkey=Gene;locus_tag=BB_B14
      BB_B10 BB_B29 BB_B18 BB_B13 BB_B14 BB_B12 BB_B04 BB_B16 BB_B22 BB_B17 BB_B27 BB_B19 BB_B07 BB_B23 BB_B09 BB_B02 BB_B28 BB_B24 BB_B03 BB_B05 BB_B06
      #!/usr/bin/perl -w use Data::Dumper; my @arrayOfVals; open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gff.txt") +|| die "Could not open $!"; LINE: while (<$tmp>) { chomp; next LINE if /^#/; # discard header, unneeded and will interfere push(@arrayOfVals, $_); } close($tmp); # input each line as an array my @tmpArray; open (my $arrVal, "<", "/Users/bioinformatics/Desktop/cp26_dff.txt") | +| die "Could not open $!"; while (<$arrVal>) { chomp; push(@tmpArray, $_); } close($arrVal); # same as above, each line of file is match string in array my @tmpDictKeyArray; my @tmpDictValueArray; for (my $k=0; $k<$#arrayOfVals; $k++) { $gffCompare = $arrayOfVals[$k]; $gffDescription = $arrayOfVals[$k+1]; # rationale: $gffCompare is the compare String, # and the information I need will always follow one entry after in + the array # if match is TRUE, then entry+1 will display the information for (my $j=0; $j<$#tmpArray; $j++) { $tmpArrayVal = $tmpArray[$j]; if ($gffCompare =~ /$tmpArrayVal/) { if ($gffCompare =~ /.*;locus_tag=(.*)/) { push(@tmpDictKeyArray, $1); # Assign to array for hash } if ($gffDescription =~ /.*;Name=(.*);protein_/) { push(@tmpDictValueArray, $1); # see above } } } } my %hashArray; @hashArray{@tmpDictKeyArray} = @tmpDictValueArray; print Dumper(\%hashArray);
        Here is a version which may help. You do not need the next LINE bit, and the C-style loop hides a bug - you never test for the last item in @tmpArray (BB_B06).
        my @arrayOfVals; open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gff.txt") +or die "Could not open $!"; while (<$tmp>) { chomp; next if /^#/; # discard header, unneeded and will interfere push(@arrayOfVals, $_); } close($tmp); my %hashOfKeys; open (my $arrVal, "<", "/Users/bioinformatics/Desktop/cp26_dff.txt") o +r die "Could not open $!"; while (<$arrVal>) { chomp; $hashOfKeys{$_} = 1; } close($arrVal); my %hash; for my $k (0 .. $#arrayOfVals -1) { if ($arrayOfVals[$k] =~ /.*;locus_tag=(.*)/) { next unless $hashOfKeys{$1}; my $key = $1; if ($arrayOfVals[$k+1] =~ /.*;Name=(.*);protein_/) { $hash{$key} = $1; } } } print Dumper(\%hash);
Re: Need Help with Maybe a Regex Issue
by Kenosis (Priest) on Feb 27, 2014 at 00:13 UTC

    It's evident that you're attempting to search a GenBank file. What, specifically, are you trying to find in that file, i.e., in which GenBank field?

      Thanks for the reply! I have a list (technically a spreadsheet that I made a list) of bacterial genes from an RNASeq sequencing run. I've isolated a sublist of strings that correspond to gene names that I am interested in investigating further, but I need to get information on them from the GenBank file, nothing too elaborate, basically Gene Name, GeneBank ID or GeneID, and a brief description. Initially I set up a simple script to type in the name to search (see the commented out STDIN section), which matched the pattern to the string in the GeneBank file-I just copied and pasted the lines I needed-but when I realized there were way too many for this to be feasible, I tried the current approach of inputing the list of genes from the RNASeq run as search patterns to match.