Need Help with Maybe a Regex Issue

sharkbait has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I have a problem (repeated in comments). I have a search string $mystring that I am searching for a matching string against a NCBI text file, nothing complicated, basic regex. The only slight modification is when I find a match, I need the string immediately following the match, not the match itself. The problem is I'm stuck. I've narrowed it down to maybe a regex issue? But basically, the files, basic text files, load, but the input from the search pattern (it's in the code as a CSV because initially I set it as such but got fed up fighting with Text::CSV so I just used text instead) doesn't seem to recognize each line of the array as a string. When I do an if loop to debug, Perl doesn't seem to recognize search patterns as strings. Below are sample text copied and pasted. I'd greatly appreciate any help! Thanks and Happy Mardi Gras!!


BB_B10

BB_B29

BB_B18

BB_B13

BB_B14

BB_B12

BB_B04

BB_B16

BB_B22

BB_B17

BB_B27

BB_B19

BB_B07

BB_B23

BB_B09

BB_B02

BB_B28

BB_B24

BB_B03

BB_B05

BB_B06
[download]

                     /locus_tag="BB_B01"
                     /db_xref="GeneID:1194411"
     CDS             46..324
                     /locus_tag="BB_B01"
                     /note="catalyzes the hydrolysis of acylphosphate"
                     /codon_start=1
                     /transl_table=11
                     /product="acylphosphatase"
                     /protein_id="NP_046987.2"
                     /db_xref="GI:364556794"
                     /db_xref="GeneID:1194411"
                     /translation="MYKQQYFISGKVQGVGFRFFTEQIANNMKLKGFVK
+NLNDGRVEI
                     VAFFNTKEQMKKFEKLLNGNKYSNIKNIEKIVLDENYPFQFNDFKIYY"
     misc_feature    46..321
                     /locus_tag="BB_B01"
                     /note="acylphosphatase; Provisional; Region: PRK1
+4432"
                     /db_xref="CDD:184678"
     gene            complement  308..751  
                     /locus_tag="BB_B02"
                     /db_xref="GeneID:1194420"
     CDS             complement  308..751  
                     /locus_tag="BB_B02"
                     /note="hypothetical protein; identified by Glimme
+r;
                     putative"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_046988.1"
                     /db_xref="GI:11497029"
                     /db_xref="GeneID:1194420"
                     /translation="MKIGPHYFFKKILKSNDNRTIYISYLYDRLASVKP
+AGEWLRIYF
                     KDSKRGKKYFILFNRNSSNGSFISCSFLKTSCNCGLDIKFSDGNLNIFC
+RNRKSLEFL
                     KFKVEHFFRTSVSCYKNNNSYVHNIKPKNKVKVLVKREASPNNKF"
     gene            complement  837..2186  
                     /locus_tag="BB_B03"
                     /db_xref="GeneID:1194419"
     CDS             complement  837..2186  
                     /locus_tag="BB_B03"
                     /note="hypothetical protein; identified by Glimme
+r;
                     putative"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_046989.1"
                     /db_xref="GI:11497028"
                     /db_xref="GeneID:1194419"
                     /translation="MPPKVKIKNDFEIFRKELEILYKKYLNNELSYLKL
+KEKLKILAE
                     NHKAILFRKDKFTNRSIILNLSKTRKIIKEYINLSVIERIRRDNTFLFF
+WKSRRIKEL
                     KNIGIKDRKKIEELIFSNQMNDEKSYFQYFIDLFVTPKWLNDYAHKYKI
+EKINSYRKE
                     QIFVKINLNTYIEIIKLLLNQSRDIRLKFYGVLMAIGRRPVEVMKLSQF
+YIADKNHIR
                     MEFIAKKRENNIVNEVVFPVFADPELIINSIKEIRYMEQTENLTKEIIS
+SNLAYSYNR
                     LFRQIFNNIFAPEESVYFCRAIYCKFSYLAFAPKNMEMNYWITKVLGHE
+PNDITTAFH
                     YNRYVLDNLDDKADNSLLTLLNQRIYTYVRRKATYSTLTMDRLESLIKE
+HHIFDDNYI
                     KTLIVIKNLMLKDNLETLAMVRGLNVKIRKAFKATYGYNYNYIKLTEYL
+SIIFNYKL"
     gene            complement  2476..3798  
                     /locus_tag="BB_B04"
                     /db_xref="GeneID:1194410"
     CDS             complement  2476..3798  
                     /locus_tag="BB_B04"
[download]

 
#!/usr/bin/perl -w
use Text::CSV;

my @arrayOfVals;
open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gbk.txt") 
+|| die "Could not open $!";

while (<$tmp>) {
    chomp;
    push(@arrayOfVals, $_);
}
close($tmp);

my @arrayFromCSV;
open(my $tmpFile, "<", "/Users/bioinformatics/Desktop/cp26_diffexpr.tx
+t") || die "Could not open $!";

while (<$tmpFile>) {
    chomp;
    push(@arrayFromCSV, $_);
}
close($tmpFile);

foreach(@arrayFromCSV) {
    if ($_ eq "BB_B10") {
        print "MAtch!!\n";
    } else {print "$_\n";}
}

my @wkArray = @arrayOfVals;
my @secArrayFromCSV = @arrayFromCSV;


 while (my $matchCheck = shift @wkArray) {            
        $csvVal = shift(@secArrayFromCSV);
        if ( $csvVal =~ /$matchCheck/) {
            my $geneValue = shift(@wkArray);
             print "$geneValue\n";
        } else {
            print "no match";
            }
}
[download]

Comment on Need Help with Maybe a Regex Issue Select or Download Code

Replies are listed 'Best First'.
Re: Need Help with Maybe a Regex Issue by hdb (Monsignor) on Feb 27, 2014 at 07:21 UTC
Under the assumption that the strings you are looking for are always the values of the `/locus_tag`, it is better to load the values you are searching for into a hash to use the exists function in addition to a regex. So you could loop over your second file, extract the value of the `/locus_tag` and if it exists in the hash, print the next line. `use strict; use warnings; my %hashOfVals; open(my $tmp, "<", "NC_001903.gbk.txt") \|\| die "Could not open $!"; undef @hashOfVals{map {chomp; $_} <$tmp>}; close($tmp); open(my $tmpFile, "<", "cp26_diffexpr.txt") \|\| die "Could not open $!" +; my @arrayFromCSV = <$tmpFile>; chomp @arrayFromCSV; close($tmpFile); for (0..$#arrayFromCSV-1) { print "$1: $arrayFromCSV[$_+1]\n" if $arrayFromCSV[$_] =~ /locus_tag="(.*)"/ and exists $hashOfV +als{$1}; }` [download]	[reply] [d/l] [select]
Re^2: Need Help with Maybe a Regex Issue by sharkbait (Initiate) on Feb 27, 2014 at 15:43 UTC
Thanks for the reply! The code compiled and ran, but produced no output. I ran into the same problem with the C-style for loops (deleted now from the code, copied and pasted below). I wasn't completely sure if my problem was a regex issue, a problem in the implementation, or something simple I missed? `for (my $j=0; $j<$#arrayFromCSV; $j++) { for (my $i=0; $i<$#wkArray; $i++) { if ("$arrayFromCSV[$j]" =~ /$wkArray[$i]/g) { print "$wkArray[$i+1]\n"; } } }` [download]	[reply] [d/l]
Re: Need Help with Maybe a Regex Issue by graff (Chancellor) on Feb 27, 2014 at 04:20 UTC
Based on what you seem to be trying to do in your third while loop (shifting values off of @wkArray), I think you actually want to use hashes instead of arrays. As it is, your code assumes that the two input files have matching elements in matching order. Do you know that the two files are already sorted in the same way, and actually have matching contents? It looks like you tried to include sample data in your post. Please edit that to put "code" tags around the data. In fact, you could include an adequate sample of data as part of your posted code, like this: `(start with a <code> tag, of course, or just <c>) # your sample code # goes here… (but please just include what you're actually using) # when it's time to read the sample data in your code, do this: while (<DATA>) { # do stuff with the sample data } # more code… enough to exit cleanly with coherent output… __DATA__ your sample data goes here… (don't forget the closing "code" tag…)` [download] Of course, if there are supposed to be two different input files, I think only one of them can be "faked" with the __DATA__ trick, and you'll need to post a separate "code" block for the other data file.	[reply] [d/l]
Re^2: Need Help with Maybe a Regex Issue by sharkbait (Initiate) on Feb 27, 2014 at 15:11 UTC
Thanks! Sorry the data files weren't formatted better, I'll fix that now. The first file is essentially a list of strings, and the second is a list of line-by-line strings.	[reply]
Re^2: Need Help with Maybe a Regex Issue by sharkbait (Initiate) on Feb 27, 2014 at 18:10 UTC
Ok, I got it to work (many thank again to hdb, Kenosis, and graff). I found a format from the NCBI called a GFF file worked much better for parsing. However, I couldn't get the suggested hash function to work, so my updated question is how would I get this to work using a hash instead and more resemble a Perl script than a cobbled Perl/C/something construct? As it stands, it functions, but it's inefficient and not elegant. (Truncated) Sample files also attached. EDIT: I made one more modification, I finally got the hang of using a hash. ##gff-version 3 #!gff-spec-version 1.20 #!processor NCBI annotwriter ##sequence-region NC_001903.1 1 26498 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2 +24326 NC_001903.1 RefSeq region 1 26498 . + . ID=id0 +;Name=cp26;Dbxref=taxon:224326;Is_circular=true;gbkey=Src;genome=plas +mid;mol_type=genomic DNA;plasmid-name=cp26;strain=B31 NC_001903.1 RefSeq gene 46 324 . + . ID=gene0; +Name=BB_B01;Dbxref=GeneID:1194411;gbkey=Gene;locus_tag=BB_B01 NC_001903.1 RefSeq CDS 46 324 . + 0 ID=cds0;Na +me=NP_046987.2;Parent=gene0;Note=catalyzes the hydrolysis of acylphos +phate;Dbxref=Genbank:NP_046987.2,GeneID:1194411;gbkey=CDS;product=acy +lphosphatase;protein_id=NP_046987.2;transl_table=11 NC_001903.1 RefSeq gene 308 751 . - . ID=gene1 +;Name=BB_B02;Dbxref=GeneID:1194420;gbkey=Gene;locus_tag=BB_B02 NC_001903.1 RefSeq CDS 308 751 . - 0 ID=cds1;N +ame=NP_046988.1;Parent=gene1;Note=hypothetical protein%3B identified +by Glimmer%3B putative;Dbxref=Genbank:NP_046988.1,GeneID:1194420;gbke +y=CDS;product=hypothetical protein;protein_id=NP_046988.1;transl_tabl +e=11 NC_001903.1 RefSeq gene 837 2186 . - . ID=gene +2;Name=BB_B03;Dbxref=GeneID:1194419;gbkey=Gene;locus_tag=BB_B03 NC_001903.1 RefSeq CDS 837 2186 . - 0 ID=cds2; +Name=NP_046989.1;Parent=gene2;Note=hypothetical protein%3B identified + by Glimmer%3B putative;Dbxref=Genbank:NP_046989.1,GeneID:1194419;gbk +ey=CDS;product=hypothetical protein;protein_id=NP_046989.1;transl_tab +le=11 NC_001903.1 RefSeq gene 2476 3798 . - . ID=gen +e3;Name=BB_B04;Dbxref=GeneID:1194410;gbkey=Gene;locus_tag=BB_B04 NC_001903.1 RefSeq CDS 2476 3798 . - 0 ID=cds3 +;Name=NP_046990.2;Parent=gene3;Note=similar to GB:U07818 PID:466474 S +P:Q45400 percent identity: 36.93%3B identified by sequence similarity +%3B putative;Dbxref=Genbank:NP_046990.2,GeneID:1194410;gbkey=CDS;prod +uct=chitibiose transporter protein ChbC;protein_id=NP_046990.2;transl +_table=11 NC_001903.1 RefSeq gene 4084 4431 . + . ID=gen +e4;Name=BB_B05;Dbxref=GeneID:1194409;gbkey=Gene;locus_tag=BB_B05 NC_001903.1 RefSeq CDS 4084 4431 . + 0 ID=cds4 +;Name=NP_046991.1;Parent=gene4;Note=similar to SP:P46319 PID:895750 P +ID:1783268 GB:AL009126 percent identity: 27.62%3B identified by seque +nce similarity%3B putative;Dbxref=Genbank:NP_046991.1,GeneID:1194409; +gbkey=CDS;product=chitibiose transporter protein ChbA;protein_id=NP_0 +46991.1;transl_table=11 NC_001903.1 RefSeq gene 4482 4757 . + . ID=gen +e5;Name=BB_B06;Dbxref=GeneID:1194408;gbkey=Gene;locus_tag=BB_B06 NC_001903.1 RefSeq CDS 4482 4757 . + 0 ID=cds5 +;Name=NP_046992.2;Parent=gene5;Note=similar to SP:P46318 PID:895748 P +ID:1783266 GB:AL009126 percent identity: 46.32%3B identified by seque +nce similarity%3B putative;Dbxref=Genbank:NP_046992.2,GeneID:1194408; +gbkey=CDS;product=chitibiose transporter protein ChbB;protein_id=NP_0 +46992.2;transl_table=11 NC_001903.1 RefSeq gene 4769 5866 . + . ID=gen +e6;Name=BB_B07;Dbxref=GeneID:1194407;gbkey=Gene;locus_tag=BB_B07 NC_001903.1 RefSeq CDS 4769 5866 . + 0 ID=cds6 +;Name=NP_046993.1;Parent=gene6;Note=similar to GB:M88764 SP:Q09090 PI +D:469166 PID:1199570 percent identity: 21.29%3B identified by sequenc +e similarity%3B putative;Dbxref=Genbank:NP_046993.1,GeneID:1194407;gb +key=CDS;product=alpha3-beta1 integrin-binding protein;protein_id=NP_0 +46993.1;transl_table=11 NC_001903.1 RefSeq gene 5888 6517 . - . ID=gen +e7;Name=BB_B08;Dbxref=GeneID:1194418;gbkey=Gene;locus_tag=BB_B08;pseu +do=true NC_001903.1 RefSeq gene 6677 7714 . + . ID=gen +e8;Name=BB_B09;Dbxref=GeneID:1194417;gbkey=Gene;locus_tag=BB_B09 NC_001903.1 RefSeq CDS 6677 7714 . + 0 ID=cds7 +;Name=NP_046995.1;Parent=gene8;Note=hypothetical protein%3B identifie +d by Glimmer%3B putative;Dbxref=Genbank:NP_046995.1,GeneID:1194417;gb +key=CDS;product=hypothetical protein;protein_id=NP_046995.1;transl_ta +ble=11 NC_001903.1 RefSeq gene 7836 8765 . + . ID=gen +e9;Name=BB_B10;Dbxref=GeneID:1194406;gbkey=Gene;locus_tag=BB_B10 NC_001903.1 RefSeq CDS 7836 8765 . + 0 ID=cds8 +;Name=NP_046996.1;Parent=gene9;Note=similar to GP:1655797 percent ide +ntity: 33.68%3B identified by sequence similarity%3B putative;Dbxref= +Genbank:NP_046996.1,GeneID:1194406;gbkey=CDS;product=hypothetical pro +tein;protein_id=NP_046996.1;transl_table=11 NC_001903.1 RefSeq gene 8781 9299 . + . ID=gen +e10;Name=BB_B11;Dbxref=GeneID:1194405;gbkey=Gene;locus_tag=BB_B11;pse +udo=true NC_001903.1 RefSeq gene 9275 10036 . + . ID=ge +ne11;Name=BB_B12;Dbxref=GeneID:1194404;gbkey=Gene;locus_tag=BB_B12 NC_001903.1 RefSeq CDS 9275 10036 . + 0 ID=cds +9;Name=NP_046998.1;Parent=gene11;Note=similar to GP:2182756 percent i +dentity: 46.00%3B identified by sequence similarity%3B putative;Dbxre +f=Genbank:NP_046998.1,GeneID:1194404;gbkey=CDS;product=hypothetical p +rotein;protein_id=NP_046998.1;transl_table=11 NC_001903.1 RefSeq gene 10104 10652 . + . ID=g +ene12;Name=BB_B13;Dbxref=GeneID:1194403;gbkey=Gene;locus_tag=BB_B13 NC_001903.1 RefSeq CDS 10104 10652 . + 0 ID=cd +s10;Name=NP_046999.1;Parent=gene12;Note=similar to GB:U03641 PID:4582 +18 percent identity: 42.59%3B identified by sequence similarity%3B pu +tative;Dbxref=Genbank:NP_046999.1,GeneID:1194403;gbkey=CDS;product=hy +pothetical protein;protein_id=NP_046999.1;transl_table=11 NC_001903.1 RefSeq gene 10920 11417 . - . ID=g +ene13;Name=BB_B14;Dbxref=GeneID:1194402;gbkey=Gene;locus_tag=BB_B14 [download] `BB_B10 BB_B29 BB_B18 BB_B13 BB_B14 BB_B12 BB_B04 BB_B16 BB_B22 BB_B17 BB_B27 BB_B19 BB_B07 BB_B23 BB_B09 BB_B02 BB_B28 BB_B24 BB_B03 BB_B05 BB_B06` [download] #!/usr/bin/perl -w use Data::Dumper; my @arrayOfVals; open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gff.txt") +\|\| die "Could not open $!"; LINE: while (<$tmp>) { chomp; next LINE if /^#/; # discard header, unneeded and will interfere push(@arrayOfVals, $_); } close($tmp); # input each line as an array my @tmpArray; open (my $arrVal, "<", "/Users/bioinformatics/Desktop/cp26_dff.txt") \| +\| die "Could not open $!"; while (<$arrVal>) { chomp; push(@tmpArray, $_); } close($arrVal); # same as above, each line of file is match string in array my @tmpDictKeyArray; my @tmpDictValueArray; for (my $k=0; $k<$#arrayOfVals; $k++) { $gffCompare = $arrayOfVals[$k]; $gffDescription = $arrayOfVals[$k+1]; # rationale: $gffCompare is the compare String, # and the information I need will always follow one entry after in + the array # if match is TRUE, then entry+1 will display the information for (my $j=0; $j<$#tmpArray; $j++) { $tmpArrayVal = $tmpArray[$j]; if ($gffCompare =~ /$tmpArrayVal/) { if ($gffCompare =~ /.;locus_tag=(.)/) { push(@tmpDictKeyArray, $1); # Assign to array for hash } if ($gffDescription =~ /.;Name=(.);protein_/) { push(@tmpDictValueArray, $1); # see above } } } } my %hashArray; @hashArray{@tmpDictKeyArray} = @tmpDictValueArray; print Dumper(\%hashArray); [download]	[reply] [d/l] [select]
Re^3: Need Help with Maybe a Regex Issue by tangent (Parson) on Feb 27, 2014 at 21:34 UTC
Here is a version which may help. You do not need the `next LINE` bit, and the C-style loop hides a bug - you never test for the last item in @tmpArray (BB_B06). my @arrayOfVals; open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gff.txt") +or die "Could not open $!"; while (<$tmp>) { chomp; next if /^#/; # discard header, unneeded and will interfere push(@arrayOfVals, $_); } close($tmp); my %hashOfKeys; open (my $arrVal, "<", "/Users/bioinformatics/Desktop/cp26_dff.txt") o +r die "Could not open $!"; while (<$arrVal>) { chomp; $hashOfKeys{$_} = 1; } close($arrVal); my %hash; for my $k (0 .. $#arrayOfVals -1) { if ($arrayOfVals[$k] =~ /.;locus_tag=(.)/) { next unless $hashOfKeys{$1}; my $key = $1; if ($arrayOfVals[$k+1] =~ /.;Name=(.);protein_/) { $hash{$key} = $1; } } } print Dumper(\%hash); [download]	[reply] [d/l] [select]
Re^4: Need Help with Maybe a Regex Issue by sharkbait (Initiate) on Mar 06, 2014 at 19:22 UTC
Re: Need Help with Maybe a Regex Issue by Kenosis (Priest) on Feb 27, 2014 at 00:13 UTC
It's evident that you're attempting to search a GenBank file. What, specifically, are you trying to find in that file, i.e., in which GenBank field?	[reply]
Re^2: Need Help with Maybe a Regex Issue by sharkbait (Initiate) on Feb 27, 2014 at 15:07 UTC
Thanks for the reply! I have a list (technically a spreadsheet that I made a list) of bacterial genes from an RNASeq sequencing run. I've isolated a sublist of strings that correspond to gene names that I am interested in investigating further, but I need to get information on them from the GenBank file, nothing too elaborate, basically Gene Name, GeneBank ID or GeneID, and a brief description. Initially I set up a simple script to type in the name to search (see the commented out STDIN section), which matched the pattern to the string in the GeneBank file-I just copied and pasted the lines I needed-but when I realized there were way too many for this to be feasible, I tried the current approach of inputing the list of genes from the RNASeq run as search patterns to match.	[reply]