justin423 has asked for the wisdom of the Perl Monks concerning the following question:

There is probably an easy way to do this, but I can't find it anywhere.

I want to match a regex only once in a long file of data.

The regex is this:

foreach my $line (@lines) { if ($line=~ m/<FILENAME>.*\.htm/) {$doc_title_temp=substr $line, 10;$ +doc_title=$doc_title_temp;print "Filename is $doc_title"};
{

But in one of the documents I was searching, it had this in there and it was matching on both, and assigning the variable value twice, so it ended up with the 2nd value

https://www.sec.gov/Archives/edgar/data/831001/000095010323011632/0000950103-23-011632.txt

<FILENAME>dp198076_424b2-us2342673.htm

<FILENAME>dp198076_exfilingfees.htm

I always want to match on the first, so the only logic I need is to just match one time in the long file and not match again.

Replies are listed 'Best First'.
Re: pattern matching once
by Marshall (Canon) on Aug 11, 2023 at 03:44 UTC
    To just get the first occurrence, stop searching after the first one is seen.
    foreach my $line (@lines) { if ($line=~ m/<FILENAME>.*\.htm/) { $doc_title_temp=substr $line, 10; $doc_title=$doc_title_temp; print "Filename is $doc_title"; last; ############### this stops };
    With small re-write:
    use strict; use warnings; my @lines = ("https://www.sec.gov/Archives/edgar/data/831001/000095010 +323011632/0000950103-23-011632.txt\n", "<FILENAME>dp198076_424b2-us2342673.htm\n", "<FILENAME>dp198076_exfilingfees.htm\n"); foreach my $line (@lines) { if (my ($doc_title) = $line=~ m/<FILENAME>(.*\.htm)/) { print "Filename is $doc_title\n"; last; ############### this stops } } __END__ Filename is dp198076_424b2-us2342673.htm
Re: pattern matching once
by eyepopslikeamosquito (Archbishop) on Aug 11, 2023 at 07:36 UTC

      Ok, this is surprising.

      It either stopped matching completely if I put chomp before it or added the \b, or added \ to the end of it when it did match.

      e.g. a link came out like this

      https://www.sec.gov/Archives/edgar/data/831001/000095010323011811/dp198116_424b2-us2343462.htm\

        To help us communicate without going around in circles, please read and try to follow:

        Here is my test program t1.pl:

        use strict; use warnings; # Small standalone test program derived from [id://11153804] # @lines contains some test lines derived from: # https://www.sec.gov/Archives/edgar/data/831001/000095010323011632/00 +00950103-23-011632.txt my @lines = ( '<FILENAME>dp198076_424b2-us2342673.htm', '<FILENAME>dp198076_exfilingfees.htm', '<FILENAME>dp198076_oopswrongextension.htmfred', ); foreach my $line (@lines) { print "line:$line:\n"; if ($line =~ m/<FILENAME>(.*\.htm)/) { print " matched line: filename='$1'\n"; } else { print " did not match line\n"; } }

        When I run this program, this is what I see:

        line:<FILENAME>dp198076_424b2-us2342673.htm: matched line: filename='dp198076_424b2-us2342673.htm' line:<FILENAME>dp198076_exfilingfees.htm: matched line: filename='dp198076_exfilingfees.htm' line:<FILENAME>dp198076_oopswrongextension.htmfred: matched line: filename='dp198076_oopswrongextension.htm'

        When I change this line above from:

        if ($line =~ m/<FILENAME>(.*\.htm)/) {
        to:
        if ($line =~ m/<FILENAME>(.*\.htm)\b/) {

        when I run this program I see instead:

        line:<FILENAME>dp198076_424b2-us2342673.htm: matched line: filename='dp198076_424b2-us2342673.htm' line:<FILENAME>dp198076_exfilingfees.htm: matched line: filename='dp198076_exfilingfees.htm' line:<FILENAME>dp198076_oopswrongextension.htmfred: did not match line

        If that is not what you were asking about, please clarify your question by posting a Short, Self-Contained, Correct Example that we can run.

        A reply falls below the community's threshold of quality. You may see it by logging in.
        you probably put the \b within the capture group.
        use strict; use warnings; my @lines = ("https://www.sec.gov/Archives/edgar/data/831001/000095010 +323011632/0000950103-23-011632.txt\n", "<FILENAME>dp198076_424b2-us2342673.htmSomeCrap\n", "<FILENAME>dp198076_424b2-us2342673.htm\n", "<FILENAME>dp198076_exfilingfees.htm\n", ); foreach my $line (@lines) { if (my ($doc_title) = $line=~ m/<FILENAME>(.*\.htm)\b/) { print "Filename is $doc_title\n"; last; ############### this stops } } __END__ Filename is dp198076_424b2-us2342673.htm
Re: pattern matching once
by karlgoethebier (Abbot) on Aug 11, 2023 at 18:40 UTC

    No pattern matching at all:

    #!/usr/bin/env perl use strict; use warnings; use HTML::TokeParser; use feature qw/say/; my $parser = HTML::TokeParser->new(q(0000950103-23-011632.txt)); say $parser->get_text if $parser->get_tag(q(filename));

    «The Crux of the Biscuit is the Apostrophe»

Re: pattern matching once
by jwkrahn (Abbot) on Aug 11, 2023 at 03:57 UTC

    Change:

    if ( $line =~ m/<FILENAME>.*\.htm/ ) {

    To:

    if ( $line =~ ?<FILENAME>.*\.htm? ) {
    Naked blocks are fun! -- Randal L. Schwartz, Perl hacker
      if ( $line =~ ?<FILENAME>.*\.htm? ) {

      This should be:

      if ( $line =~ m?<FILENAME>.*\.htm? ) {

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

        perldoc perlop
        
        ...
        
               "m/PATTERN/msixpodualngc"
               "/PATTERN/msixpodualngc"
        
        ...
        
               "m?PATTERN?msixpodualngc"
               "?PATTERN?msixpodualngc"
        
        

        With the // or ?? delimiters the m at the beginning is optional.

        Naked blocks are fun! -- Randal L. Schwartz, Perl hacker
Re: pattern matching once
by AnomalousMonk (Archbishop) on Aug 11, 2023 at 19:37 UTC

    Follwing is a "pure regex" approach. It may of interest/useful if:

    • All text will fit in available memory (a line-by-line approach has no inherent memory limit);
    • eyepopslikeamosquito's wise caution (really haukex's) against using regex to parse (X|HT)ML is heeded;
    • Speed is not critical (I suspect the regex engine will be heavily loaded and line-by-line will be quicker, but only Benchmark-ing will tell the tale).
    Win8 Strawberry 5.8.9.5 (32) Fri 08/11/2023 14:59:13 C:\@Work\Perl\monks >perl use strict; use warnings; use Test::More; use Test::NoWarnings; my @Tests = ( 'ALL these should match', # match # string success [ '<FILENAME>.htm' => 1, ], [ '<FILENAME>dp198076_424b2-us2342673.htm' => 1, ], [ 'fizz <FILENAME>dp198076_424b2-us2342673.htm fuzz' => 1, ], 'NONE of these should match', [ '' => '', ], [ ' ' => '', ], [ 'xxx' => '', ], [ "\n\n\n" => '', ], [ '<FILENAME>dp198076_424b2-us2342673.htm <FILENAME>dp198076_exfilingfees.htm' => '', ], [ '<FILENAME>.htmfred' => '', ], # see pm#11153807 [ 'fizz <FILENAME>dp198076_424b2-us2342673.htm foo bar <FILENAME>dp198076_exfilingfees.htm fuzz' => '', ], [ 'fizz <FILENAME>dp198076_424b2-us2342673.htm foo <FILENAME>xxx.htm bar <FILENAME>dp198076_exfilingfees.htm fuzz' => '', ], [ '<FILENAME>xxx.htm<FILENAME>yyy.htm' => '', ], [ '<FILENAME>.htm<FILENAME>.htm' => '', ], ); # end @Tests my @additional = qw(Test::NoWarnings); # each of these adds 1 test plan 'tests' => (scalar grep { ref eq 'ARRAY' } @Tests) + @additional ; my $rx_once = qr{ <FILENAME> (?: (?! [.]htm) .)* [.]htm \b }xms; VECTOR: for my $ar_vector (@Tests) { if (not ref $ar_vector) { note $ar_vector; next VECTOR; } my ($string, $expected) = @$ar_vector; my $got = $string =~ m{ \A (?: (?! $rx_once) .)* # no match before single match $rx_once # match just once (?: (?! $rx_once) .)* # no match after single match \Z }xms; is $got, $expected; } # end for VECTOR ^Z 1..14 # ALL these should match ok 1 ok 2 ok 3 # NONE of these should match ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 - no warnings

    Update: Note that as the code stands, the test vector
        [ '<FILENAME>.htm<FILENAME>.htmfred' => 1, ]
    will match/pass, i.e., will be accepted as having no duplication. Should this be the case?


    Give a man a fish:  <%-{-{-{-<