in reply to Regex keep matching the last possible match (but should get all)

Actually, the very first .+ in your regex will gobble up as many characters as possible to still match the whole pattern that follows.

In other words, after meeting the first <TD ALIGN=LEFT, .+ will match everything up to the last extinfo.cgi in your long string.

To see what I mean, put the first .+ between brackets and print $1.

.+ (and its even more treacherous brother .*) will quickly escape your control if you are not careful. A useful technique to control what gets matched is to indicate the character(s) you don't want: [^>]+, means match anything, except the '>' character, or in other words, until the end of the current HTML tag. It prevents the regex quantifiers to run away.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Replies are listed 'Best First'.
Re^2: Regex keep matching the last possible match (but should get all)
by Anonymous Monk on May 18, 2015 at 12:48 UTC

    Dear Perl-Monk,

    now I understand whats up with the [^<]+ that has been suggested already. And you are totally right - If I put in brackets whats before the wanted first group, I'll get the whole html-file up to the very last extinfo...

    Sadly, when I try to use [^>]+ it still does grab all the content up to the last position. *snip*

    /'extinfo\.cgi[^>]+(.+)<\/A.+'status.+>(.+)<\/TD.+nowrap>(.+)<\/TD.+no +wrap>(.+s)<\/TD>.+'>(\d\/\d)<\/TD>.+'>(.+)<\/TD>/g)

    this will place the whole html file in $1 except for the next groups. How do I have to write this area behind the 'extinfo\.cgi' to make it stop at the > and get the group correctly?

    I tried to use

    /'extinfo\.cgi[^>]+>(.+)<\/A.+'status.+>(.+)<\/TD.+nowrap>(.+)<\/TD.+n +owrap>(.+s)<\/TD>.+'>(\d\/\d)<\/TD>.+'>(.+)<\/TD>/g)
    which yield the same result: all of the .html-file inside $1 except for the last group-matches.

    I tried to use something like /bla\w{,100}>(.+)<, but this won't match any more. *sigh* a whole working day right now just for making a single RegEx... And I see it comming that I have to insert this "stop at the next whatever" everywhere, because the next .+ between the first () will keep going to the end too, isn't it?

    Greetings, a tired Visitor

      Please re-read and understand when to use [^>] and when to use [^<]. They are to be used in different situations, as I already told you.

      its me again (I truly need an account here), stop pondering about my problem for a while, because I think I figured out how to write the RegEx-Chain of Doom I need to get what I want.