eweaverp has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks...

My gorgeous little regex

($temp, $id) = $string =~ m/(Gallus\sgallus|Chicken).*?([0-9]+)>.*?$ri +ght.*?\n/i;
doesn't match the line
<a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=18656900&dopt=GenBank" >gb|AF468789.1|</a> Gal +lus gallus LIM domain-containing transcri... <a href = #18656900> 4 +8</a> 0.006
But why? Am I abusing the '|' perhaps? I intended it to mean 'or'.

Thanks...

~evan

Context of match:
<a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=31543124&dopt=GenBank" >ref|NM_008505.2|</a> M +us musculus LIM domain only 2 (Lmo2), mRNA <a href = #31543124>47 +6</a> e-132 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=12850762&dopt=GenBank" >dbj|AK013416.1|</a> Mu +s musculus 10, 11 days embryo whole body ... <a href = #12850762>47 +6</a> e-132 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=27551003&dopt=GenBank" >emb|AL928544.3|</a> Mo +use DNA sequence from clone RP23-358C5 on... <a href = #27551003>39 +1</a> e-106 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=06633806&dopt=GenBank" >ref|NM_005574.2|</a> H +omo sapiens LIM domain only 2 (rhombotin-... <a href = #6633806>218 +</a> 3e-54 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=27502790&dopt=GenBank" >gb|BC042426.1|</a> Hom +o sapiens, LIM domain only 2 (rhombotin-l... <a href = #27502790>21 +8</a> 3e-54 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=27356720&dopt=GenBank" >gb|AC132216.7|</a> Hom +o sapiens chromosome 11, clone RP13-786C1... <a href = #27356720>21 +8</a> 3e-54 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=23272667&dopt=GenBank" >gb|BC035607.1|</a> Hom +o sapiens, LIM domain only 2 (rhombotin-l... <a href = #23272667>21 +8</a> 3e-54 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=21706498&dopt=GenBank" >gb|BC034041.1|</a> Hom +o sapiens, LIM domain only 2 (rhombotin-l... <a href = #21706498>21 +8</a> 3e-54 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=00663012&dopt=GenBank" >emb|X61118.1|HSTTG2</a> + Human TTG-2 mRNA for a cysteine rich pr... <a href = #663012>218< +/a> 3e-54 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=00200748&dopt=GenBank" >gb|M64360.1|MUSRHOM2B</ +a> Mouse rhom-2 mRNA, complete cds <a href = #200748> 68< +/a> 7e-09 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=18656900&dopt=GenBank" >gb|AF468789.1|</a> Gal +lus gallus LIM domain-containing transcri... <a href = #18656900> 4 +8</a> 0.006 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=28626638&dopt=GenBank" >gb|AC120535.5|</a> Ory +za sativa chromosome 3 BAC OSJNBa0092N01 ... <a href = #28626638> 4 +6</a> 0.025 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=24413773&dopt=GenBank" >emb|AL939114.1|SCO93911 +4</a> Streptomyces coelicolor A3(2) comp... <a href = #24413773> 4 +2</a> 0.39 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=14794889&dopt=GenBank" >gb|AF357202.1|</a> Str +eptomyces nodosus amphotericin biosynthet... <a href = #14794889> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=22474846&dopt=GenBank" >gb|AC131565.1|</a> Hom +o sapiens chromosome 5 clone RP11-1273L8,... <a href = #22474846> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=21070666&dopt=GenBank" >gb|AC114959.2|</a> Hom +o sapiens chromosome 5 clone RP11-170L13,... <a href = #21070666> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=18767410&dopt=GenBank" >gb|AC011368.5|</a> Hom +o sapiens chromosome 5 clone CTB-104K23, ... <a href = #18767410> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=16356869&dopt=GenBank" >gb|AC011351.5|</a> Hom +o sapiens chromosome 5 clone CTC-325N22, ... <a href = #16356869> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=20451702&dopt=GenBank" >emb|AL731593.1|OSJN0023 +5</a> Oryza sativa genomic DNA, chromoso... <a href = #20451702> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=21912446&dopt=GenBank" >emb|AL606441.2|OSJN0001 +2</a> Oryza sativa genomic DNA, chromoso... <a href = #21912446> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=30984154&dopt=GenBank" >dbj|AP005785.2|</a> Or +yza sativa (japonica cultivar-group) geno... <a href = #30984154> 4 +0</a> 1.5 <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=29122863&dopt=GenBank" >dbj|AP005167.2|</a> Or +yza sativa (japonica cultivar-group) geno... <a href = #29122863> 4 +0</a> 1.5

Replies are listed 'Best First'.
Re: Regex?
by sauoq (Abbot) on Jul 01, 2003 at 22:46 UTC

    Seems to work fine for me. I'm not sure exactly what you want to match with it, but this:

    $string = <DATA>; $right = quotemeta('</a>'); ($temp, $id)= $string =~ m/(Gallus\sgallus|Chicken).*?([0-9]+)>.*?$rig +ht.*?\n/i; print "$temp: $id\n"; __DATA__ <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db +=Nucleotide&list_uids=18656900&dopt=GenBank" >gb|AF468789.1|</a> Gal +lus gallus LIM domain-containing transcri... <a href = #18656900> 4 +8</a> 0.006
    prints this:
    Gallus gallus: 18656900
    -sauoq
    "My two cents aren't worth a dime.";
    

      You're right, it works. Thanks for your sample code though, it helped me exclude what I thought was certainly the problem. The _real_ problem was a variable name collision farther up that made it look like the regex was failing.

      Bleh. Oh well. Thanks all... one more notch for the Stick of Experience.

      ~evan

Re: Regex?
by mr_mischief (Monsignor) on Jul 01, 2003 at 22:35 UTC
    Why do you include '\n' in the regex?

    Christopher E. Stith
      He's right -- take out that and it works just fine for me. And if I put a linebreak at the end of $string that works too.



      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
      M-J D

        His sample data looks like it has newlines. Did you test yours with a newline in the data? It should work fine (for some definition of "fine") if the data contains one.

        This may well be the problem. He may be removing the newline, for instance. But, as given, there isn't error at all. The code works.

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Regex?
by dragonchild (Archbishop) on Jul 01, 2003 at 22:57 UTC
    Would HTML::Parser help?

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Regex?
by eweaverp (Scribe) on Jul 01, 2003 at 22:02 UTC

    Whoops...insert this

    my $right = quotemeta('</a>');

    before the regex. My bad.

    ~evan