pattern matching once

justin423 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: pattern matching once by Marshall (Canon) on Aug 11, 2023 at 03:44 UTC
To just get the first occurrence, stop searching after the first one is seen. `foreach my $line (@lines) { if ($line=~ m/<FILENAME>.\.htm/) { $doc_title_temp=substr $line, 10; $doc_title=$doc_title_temp; print "Filename is $doc_title"; last; ############### this stops };` [download] With small re-write: `use strict; use warnings; my @lines = ("https://www.sec.gov/Archives/edgar/data/831001/000095010 +323011632/0000950103-23-011632.txt\n", "<FILENAME>dp198076_424b2-us2342673.htm\n", "<FILENAME>dp198076_exfilingfees.htm\n"); foreach my $line (@lines) { if (my ($doc_title) = $line=~ m/<FILENAME>(.\.htm)/) { print "Filename is $doc_title\n"; last; ############### this stops } } __END__ Filename is dp198076_424b2-us2342673.htm` [download]	[reply] [d/l] [select]
Re: pattern matching once by eyepopslikeamosquito (Archbishop) on Aug 11, 2023 at 07:36 UTC
Do you have a specification of the format of these files? Be aware that using a regex is strongly frowned upon for parsing HTML and XML, as mentioned at: Why a regex really isn't good enough for HTML and XML, even for "simple" tasks by haukex Parsing HTML/XML with Regular Expressions by haukex regex match open tags except XHTML... (SO) Though Marshall has already provided you with a regex solution, note that if you only want to match `.htm` (and not `.htmfred` say) his regex: `m/<FILENAME>(.\.htm)/` [download] should read: `m/<FILENAME>(.\.htm)\b/` [download] using the `\b` regex assertion to match only on a word boundary (see perlre for more detail).	[reply] [d/l] [select]
Re^2: pattern matching once by justin423 (Scribe) on Aug 11, 2023 at 13:33 UTC
Ok, this is surprising. It either stopped matching completely if I put chomp before it or added the \b, or added \ to the end of it when it did match. e.g. a link came out like this `https://www.sec.gov/Archives/edgar/data/831001/000095010323011811/dp198116_424b2-us2343462.htm\`	[reply] [d/l]
Re^3: pattern matching once by eyepopslikeamosquito (Archbishop) on Aug 11, 2023 at 14:24 UTC
To help us communicate without going around in circles, please read and try to follow: I know what I mean. Why don't you? Short, Self-Contained, Correct Example Here is my test program `t1.pl`: use strict; use warnings; # Small standalone test program derived from [id://11153804] # @lines contains some test lines derived from: # https://www.sec.gov/Archives/edgar/data/831001/000095010323011632/00 +00950103-23-011632.txt my @lines = ( '<FILENAME>dp198076_424b2-us2342673.htm', '<FILENAME>dp198076_exfilingfees.htm', '<FILENAME>dp198076_oopswrongextension.htmfred', ); foreach my $line (@lines) { print "line:$line:\n"; if ($line =~ m/<FILENAME>(.\.htm)/) { print " matched line: filename='$1'\n"; } else { print " did not match line\n"; } } [download] When I run this program, this is what I see: `line:<FILENAME>dp198076_424b2-us2342673.htm: matched line: filename='dp198076_424b2-us2342673.htm' line:<FILENAME>dp198076_exfilingfees.htm: matched line: filename='dp198076_exfilingfees.htm' line:<FILENAME>dp198076_oopswrongextension.htmfred: matched line: filename='dp198076_oopswrongextension.htm'` [download] When I change this line above from: `if ($line =~ m/<FILENAME>(.\.htm)/) {` [download] to: `if ($line =~ m/<FILENAME>(.*\.htm)\b/) {` [download] when I run this program I see instead: `line:<FILENAME>dp198076_424b2-us2342673.htm: matched line: filename='dp198076_424b2-us2342673.htm' line:<FILENAME>dp198076_exfilingfees.htm: matched line: filename='dp198076_exfilingfees.htm' line:<FILENAME>dp198076_oopswrongextension.htmfred: did not match line` [download] If that is not what you were asking about, please clarify your question by posting a Short, Self-Contained, Correct Example that we can run.	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: pattern matching once by Marshall (Canon) on Aug 11, 2023 at 16:50 UTC
you probably put the \b within the capture group. `use strict; use warnings; my @lines = ("https://www.sec.gov/Archives/edgar/data/831001/000095010 +323011632/0000950103-23-011632.txt\n", "<FILENAME>dp198076_424b2-us2342673.htmSomeCrap\n", "<FILENAME>dp198076_424b2-us2342673.htm\n", "<FILENAME>dp198076_exfilingfees.htm\n", ); foreach my $line (@lines) { if (my ($doc_title) = $line=~ m/<FILENAME>(.*\.htm)\b/) { print "Filename is $doc_title\n"; last; ############### this stops } } __END__ Filename is dp198076_424b2-us2342673.htm` [download]	[reply] [d/l]
Re^4: pattern matching once by justin423 (Scribe) on Aug 11, 2023 at 16:58 UTC
Re^5: pattern matching once by Marshall (Canon) on Aug 11, 2023 at 19:45 UTC
Some notes below your chosen depth have not been shown here
Re: pattern matching once by karlgoethebier (Abbot) on Aug 11, 2023 at 18:40 UTC
No pattern matching at all: `#!/usr/bin/env perl use strict; use warnings; use HTML::TokeParser; use feature qw/say/; my $parser = HTML::TokeParser->new(q(0000950103-23-011632.txt)); say $parser->get_text if $parser->get_tag(q(filename));` [download] ŤThe Crux of the Biscuit is the Apostropheť	[reply] [d/l]
Re: pattern matching once by jwkrahn (Abbot) on Aug 11, 2023 at 03:57 UTC
Change: `if ( $line =~ m/<FILENAME>.\.htm/ ) {` [download] To: `if ( $line =~ ?<FILENAME>.\.htm? ) {` [download] Naked blocks are fun! -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re^2: pattern matching once by jo37 (Curate) on Aug 11, 2023 at 09:30 UTC
`if ( $line =~ ?<FILENAME>.\.htm? ) {` [download] This should be: `if ( $line =~ m?<FILENAME>.\.htm? ) {` [download] Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re^3: pattern matching once by jwkrahn (Abbot) on Aug 11, 2023 at 17:27 UTC
perldoc perlop ... "m/PATTERN/msixpodualngc" "/PATTERN/msixpodualngc" ... "m?PATTERN?msixpodualngc" "?PATTERN?msixpodualngc" With the // or ?? delimiters the m at the beginning is optional. Naked blocks are fun! -- Randal L. Schwartz, Perl hacker	[reply]
Re^4: pattern matching once by haukex (Archbishop) on Aug 11, 2023 at 17:42 UTC
Re^5: pattern matching once by jwkrahn (Abbot) on Aug 11, 2023 at 20:03 UTC
Re^4: pattern matching once by haj (Vicar) on Aug 11, 2023 at 17:45 UTC
Re^5: pattern matching once by choroba (Cardinal) on Aug 12, 2023 at 18:40 UTC
Re: pattern matching once by AnomalousMonk (Archbishop) on Aug 11, 2023 at 19:37 UTC
Follwing is a "pure regex" approach. It may of interest/useful if: All text will fit in available memory (a line-by-line approach has no inherent memory limit); eyepopslikeamosquito's wise caution (really haukex's) against using regex to parse (X\|HT)ML is heeded; Speed is not critical (I suspect the regex engine will be heavily loaded and line-by-line will be quicker, but only Benchmark-ing will tell the tale). Win8 Strawberry 5.8.9.5 (32) Fri 08/11/2023 14:59:13 C:\@Work\Perl\monks >perl use strict; use warnings; use Test::More; use Test::NoWarnings; my @Tests = ( 'ALL these should match', # match # string success [ '<FILENAME>.htm' => 1, ], [ '<FILENAME>dp198076_424b2-us2342673.htm' => 1, ], [ 'fizz <FILENAME>dp198076_424b2-us2342673.htm fuzz' => 1, ], 'NONE of these should match', [ '' => '', ], [ ' ' => '', ], [ 'xxx' => '', ], [ "\n\n\n" => '', ], [ '<FILENAME>dp198076_424b2-us2342673.htm <FILENAME>dp198076_exfilingfees.htm' => '', ], [ '<FILENAME>.htmfred' => '', ], # see pm#11153807 [ 'fizz <FILENAME>dp198076_424b2-us2342673.htm foo bar <FILENAME>dp198076_exfilingfees.htm fuzz' => '', ], [ 'fizz <FILENAME>dp198076_424b2-us2342673.htm foo <FILENAME>xxx.htm bar <FILENAME>dp198076_exfilingfees.htm fuzz' => '', ], [ '<FILENAME>xxx.htm<FILENAME>yyy.htm' => '', ], [ '<FILENAME>.htm<FILENAME>.htm' => '', ], ); # end @Tests my @additional = qw(Test::NoWarnings); # each of these adds 1 test plan 'tests' => (scalar grep { ref eq 'ARRAY' } @Tests) + @additional ; my $rx_once = qr{ <FILENAME> (?: (?! [.]htm) .)* [.]htm \b }xms; VECTOR: for my $ar_vector (@Tests) { if (not ref $ar_vector) { note $ar_vector; next VECTOR; } my ($string, $expected) = @$ar_vector; my $got = $string =~ m{ \A (?: (?! $rx_once) .)* # no match before single match $rx_once # match just once (?: (?! $rx_once) .)* # no match after single match \Z }xms; is $got, $expected; } # end for VECTOR ^Z 1..14 # ALL these should match ok 1 ok 2 ok 3 # NONE of these should match ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 - no warnings [download] Update: Note that as the code stands, the test vector `[ '<FILENAME>.htm<FILENAME>.htmfred' => 1, ]` will match/pass, i.e., will be accepted as having no duplication. Should this be the case? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]