in reply to Re^3: pattern matching once
in thread pattern matching once

You were not that far off...

The file is actually this:

"<FILENAME>dp198076_424b2-us2342673.htm \n", "<FILENAME>dp198076_exfilingfees.htm\n",

with a space after the .htm in some cases but not others so the /b didn't work all the time

Replies are listed 'Best First'.
Re^5: pattern matching once
by Marshall (Canon) on Aug 11, 2023 at 19:45 UTC
    You will have to show some runnable code where the \b fails. Both of your example lines work fine in my example code.

    \b means approximately "word boundary". Any white space character (space or \n or other such character like \t) satisfies that boundary condition. End of the string also satisfies that boundary condition (i.e. having no character following ".htm").

    What do you mean by " so the /b didn't work all the time"?

    Look carefully and make sure that there is no space before the \b in:
    if (my ($doc_title) = $line=~ m/<FILENAME>(.*\.htm)\b/) {

      An HTML space &nbsp.
        That still works; "<FILENAME>dp198076_424b2-us2342673.htm&nbsp\n",
        That is because \b is a word to non-word boundary. & is not a word character. Word characters are the ones that you can use in a Perl variable name. [a-zA-Z0-9_]

        So we are back at the same problem, you say that there is a problem, but refuse to show any actual code.
        If you are actually parsing an HTML doc, you should be using one of the HTML decoder modules before trying to use regex. I believe that haukex has posted some links on that subject. I think you are well advised to read his post in detail.