in reply to pattern matching once

Do you have a specification of the format of these files?

Be aware that using a regex is strongly frowned upon for parsing HTML and XML, as mentioned at:

Though Marshall has already provided you with a regex solution, note that if you only want to match .htm (and not .htmfred say) his regex:

m/<FILENAME>(.*\.htm)/
should read:
m/<FILENAME>(.*\.htm)\b/
using the \b regex assertion to match only on a word boundary (see perlre for more detail).

Replies are listed 'Best First'.
Re^2: pattern matching once
by justin423 (Scribe) on Aug 11, 2023 at 13:33 UTC

    Ok, this is surprising.

    It either stopped matching completely if I put chomp before it or added the \b, or added \ to the end of it when it did match.

    e.g. a link came out like this

    https://www.sec.gov/Archives/edgar/data/831001/000095010323011811/dp198116_424b2-us2343462.htm\

      To help us communicate without going around in circles, please read and try to follow:

      Here is my test program t1.pl:

      use strict; use warnings; # Small standalone test program derived from [id://11153804] # @lines contains some test lines derived from: # https://www.sec.gov/Archives/edgar/data/831001/000095010323011632/00 +00950103-23-011632.txt my @lines = ( '<FILENAME>dp198076_424b2-us2342673.htm', '<FILENAME>dp198076_exfilingfees.htm', '<FILENAME>dp198076_oopswrongextension.htmfred', ); foreach my $line (@lines) { print "line:$line:\n"; if ($line =~ m/<FILENAME>(.*\.htm)/) { print " matched line: filename='$1'\n"; } else { print " did not match line\n"; } }

      When I run this program, this is what I see:

      line:<FILENAME>dp198076_424b2-us2342673.htm: matched line: filename='dp198076_424b2-us2342673.htm' line:<FILENAME>dp198076_exfilingfees.htm: matched line: filename='dp198076_exfilingfees.htm' line:<FILENAME>dp198076_oopswrongextension.htmfred: matched line: filename='dp198076_oopswrongextension.htm'

      When I change this line above from:

      if ($line =~ m/<FILENAME>(.*\.htm)/) {
      to:
      if ($line =~ m/<FILENAME>(.*\.htm)\b/) {

      when I run this program I see instead:

      line:<FILENAME>dp198076_424b2-us2342673.htm: matched line: filename='dp198076_424b2-us2342673.htm' line:<FILENAME>dp198076_exfilingfees.htm: matched line: filename='dp198076_exfilingfees.htm' line:<FILENAME>dp198076_oopswrongextension.htmfred: did not match line

      If that is not what you were asking about, please clarify your question by posting a Short, Self-Contained, Correct Example that we can run.

      A reply falls below the community's threshold of quality. You may see it by logging in.
      you probably put the \b within the capture group.
      use strict; use warnings; my @lines = ("https://www.sec.gov/Archives/edgar/data/831001/000095010 +323011632/0000950103-23-011632.txt\n", "<FILENAME>dp198076_424b2-us2342673.htmSomeCrap\n", "<FILENAME>dp198076_424b2-us2342673.htm\n", "<FILENAME>dp198076_exfilingfees.htm\n", ); foreach my $line (@lines) { if (my ($doc_title) = $line=~ m/<FILENAME>(.*\.htm)\b/) { print "Filename is $doc_title\n"; last; ############### this stops } } __END__ Filename is dp198076_424b2-us2342673.htm

        You were not that far off...

        The file is actually this:

        "<FILENAME>dp198076_424b2-us2342673.htm \n", "<FILENAME>dp198076_exfilingfees.htm\n",

        with a space after the .htm in some cases but not others so the /b didn't work all the time