Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl-Monks,

a month ago I didn't even know that there is a language called Perl (except for some dubious notes from fellow people that shared obscufated code snipplets for the enjoyment of us from Java-island).

Anyway... right now I'm in the middle of it, slowly finding my way around. But right now, there is a riddle I can't solve.

First: I heard of the module I should use to parse html-files, and I might use it n future, but right now - as the title suggests - I have a regex, that snatch stuff out of a html-file. And for reasons I can't determine, it just find the very last possible match.

So... the html-file does look like this:

<TR> <TD></TD> <TD CLASS='statusOdd'><TABLE BORDER=0 WIDTH='100%' CELLSPACING=0 CELLP +ADDING=0><TR><TD ALIGN=LEFT><TABLE BORDER=0 CELLSPACING=0 CELLPADDING +=0> <TR> <TD ALIGN=LEFT valign=center CLASS='statusOdd'><A HREF='extinfo.cgi?re +questWithPrivateInformations'>Description</A></TD></TR> </TABLE> </TD> <TD ALIGN=RIGHT CLASS='statusOdd'> <TABLE BORDER=0 cellspacing=0 cellpadding=0> <TR> </TR> </TABLE> </TD> </TR></TABLE></TD> <TD CLASS='statusOK'>OK</TD> <TD CLASS='statusOdd' nowrap>2015-05-17 01:59:48</TD> <TD CLASS='statusOdd' nowrap>145d 19h 53m 11s</TD> <TD CLASS='statusOdd'>1/4</TD> <TD CLASS='statusOdd' valign='center'>something that shoudn't be publi +shed in the Internet;</TD> </TR>

Beware, that the table that shines though this snipplet contains a lot of similiar tag-constructs (about 50 table-lines, which I all want in that array you see in the next code-segment). Some of the readers might have heard of Nagios (well, actually I think most might use it too), and yes, thats a state-information. I made the information obscure to protect the affected customers dates, so don't ponder about the nonsense you see there.

and thats the Regex I use to get pieces of information out of it:

my @superContainer; while($longline =~ /<TD ALIGN=LEFT.+extinfo.cgi.+'>(.+)<\/A.+'status.+ +>(.+)<\/TD.+nowrap>(.+)<\/TD.+nowrap>(.+s)<\/TD>.+'>(\d\/\d)<\/TD>.+' +>(.+)<\/TD>*?/g){ my @subContainer = ($1, $2, $3, $4, $5, $6); push @superContainer, \@subContainer }

Ah... no line-recognition? That's okay, because (you don't see this) I placed the whole .html file in a single line string to avoid... happenings that might happen if you have line-terminations in your source. The whole html-file - one $longline.

used with the whole file that Nagios send over, I will get exaclty ONE match, which is the very last occurence. I tried this with something that looked like "aaaa bbbb ccc aa bbbb cc dd aaa cc dd" and used a similiar RegEx to snatch all... well, in short, this: /(a+).+(a+)/ And it did what I expected: providing $1 ... $n with the As inside, even for multible possible matches.

Hm... one thing - I added /regex/gc to my construct, but that didn't do anything. In fact, most of the answers the internet had (*? at the and, or the beginning, some letters behind //, more and MORE brackets) didn't change my result: only the very last match will be recognized.

In the debugger, i will get something like this at the end when I ask for x @superContainer:

0 ARRAY(0x1e2b628) 0 'Description' 1 'OK' 2 '2015-05-17 01:59:48' 3 '145d 19h 53m 11s' 4 '1/4' 5 'something that shoudn't be published in the Internet;'
but where are the other 49 matches? *cry*

Can someone please provide this humble person with an explanation, how to tell the RegEx to find ALL of these matches that are in this file?

Greetings someone that might register himself in near future

Replies are listed 'Best First'.
Re: Regex keep matching the last possible match (but should get all)
by CountZero (Bishop) on May 18, 2015 at 12:07 UTC
    Actually, the very first .+ in your regex will gobble up as many characters as possible to still match the whole pattern that follows.

    In other words, after meeting the first <TD ALIGN=LEFT, .+ will match everything up to the last extinfo.cgi in your long string.

    To see what I mean, put the first .+ between brackets and print $1.

    .+ (and its even more treacherous brother .*) will quickly escape your control if you are not careful. A useful technique to control what gets matched is to indicate the character(s) you don't want: [^>]+, means match anything, except the '>' character, or in other words, until the end of the current HTML tag. It prevents the regex quantifiers to run away.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      Dear Perl-Monk,

      now I understand whats up with the [^<]+ that has been suggested already. And you are totally right - If I put in brackets whats before the wanted first group, I'll get the whole html-file up to the very last extinfo...

      Sadly, when I try to use [^>]+ it still does grab all the content up to the last position. *snip*

      /'extinfo\.cgi[^>]+(.+)<\/A.+'status.+>(.+)<\/TD.+nowrap>(.+)<\/TD.+no +wrap>(.+s)<\/TD>.+'>(\d\/\d)<\/TD>.+'>(.+)<\/TD>/g)

      this will place the whole html file in $1 except for the next groups. How do I have to write this area behind the 'extinfo\.cgi' to make it stop at the > and get the group correctly?

      I tried to use

      /'extinfo\.cgi[^>]+>(.+)<\/A.+'status.+>(.+)<\/TD.+nowrap>(.+)<\/TD.+n +owrap>(.+s)<\/TD>.+'>(\d\/\d)<\/TD>.+'>(.+)<\/TD>/g)
      which yield the same result: all of the .html-file inside $1 except for the last group-matches.

      I tried to use something like /bla\w{,100}>(.+)<, but this won't match any more. *sigh* a whole working day right now just for making a single RegEx... And I see it comming that I have to insert this "stop at the next whatever" everywhere, because the next .+ between the first () will keep going to the end too, isn't it?

      Greetings, a tired Visitor

        Please re-read and understand when to use [^>] and when to use [^<]. They are to be used in different situations, as I already told you.

        its me again (I truly need an account here), stop pondering about my problem for a while, because I think I figured out how to write the RegEx-Chain of Doom I need to get what I want.
Re: Regex keep matching the last possible match (but should get all)
by Corion (Patriarch) on May 18, 2015 at 09:28 UTC

    You could try making all your gobbling matches less greedy: Change .+ to .+?. Even better would be [^>]+ for stuff within HTML tags and [^<] for stuff that is supposed to capture (text) content.

    Even better would be to use a real HTML parser like for example HTML::TreeBuilder::XPath.

      Dear Perl-Monks

      I will have a look at the links provided at the page in time, but in the meanwhile I created a counter-example to verify that my codings indeed will yield results in a way I expect them to do.

      consider the following file

      blabla:(123):falleriefallera dingdong moep blubb 4711 dingdong blob))hop((gob))sob((0815))ding knickknack boing 44 nothing here blabla:(123):falleriefallera dingdong moep blubb 471 dingdong blob))hop((gob))sob((0815))ding knickknack boing 45 nothing here too blabla:(1344):falleriefallera dingdong moep blubb 4711 dingdong blob))hop((gob))sob((0815))ding knickknack boing 46 nothing again blabla:(123):falleriefallera dingdong moep blubb 4711 dingdong blob))hop((gob))sob((0825))ding knickknack boing 47

      access it using the following perl-script:

      use strict; use warnings; # 1. get file and stuff it into an array # that what it will be in target code open FILE, 'target.txt' or die "nope dude: $!"; my @stuff; while(<FILE>){ chomp $_; push @stuff, $_; } print "reading done "; # 2. make a long line out of it # because I still have problems using an array for this :( my $longline; foreach my $x (@stuff){ $longline .= $x; } # 3. get all matches and place them in an array array x) my @super; while ($longline =~ /\D+(\d+)\D+(\d+)\D+(\d+)\D+(\d+)/g){ my @sub = ($1, $2, $3, $4); push @super, \@sub; } # 4. we should have four entries in that @super print scalar @super, "\n";

      will yield this (at least the debugger think so):

      0 ARRAY(0x1f08820) 0 123 1 4711 2 0815 3 44 1 ARRAY(0x2199678) 0 123 1 471 2 0815 3 45 2 ARRAY(0x21994e0) 0 1344 1 4711 2 0815 3 46 3 ARRAY(0x219f128) 0 123 1 4711 2 0825 3 47

      so it will work in the way I hoped for. IF I ever can create a valid regex for this. But now I'm busy looking into these walktroughs.

      By the way; using .+? didn't made the RegEx work, but I don't understand how [^>] should be utilized to help me in my case :( Because... I do find the correct piece of plain text in my file, so how should I include "no >" and "no <" inside?

      Greetings, a random visitor

        Your example is far more restricted because a character in \D (a non-digit) can never be matched by a character in \d (a digit) and vice-versa.

        This is why I suggested that you could use [^<]+ for characters within tags or [^>]+ for characters outside of tags. Both will only match normal characters and not closing (or opening) a tag.

Re: Regex keep matching the last possible match (but should get all)
by aaron_baugher (Curate) on May 18, 2015 at 12:59 UTC

    Each greedy match, working left to right (an important point when you have several as in this case), swallows up as many characters as it can while still allowing the match to succeed. So:

    $s = 'aaaaaaaaa'; $s =~ /.+(a+)/; # greedy match print "$1\n"; # prints "a" (a single 'a') $s = 'aaaaaaaaa'; $s =~ /.+?(a+)/; # non-greedy match print "$1\n"; # prints "aaaaaaaa" (all the 'a's)

    So when your pattern is matching too much, start from the end and work your way backwards. Look at what each wildcard pattern (like .+) can match, and think about how to restrict it so it won't match too much. Is there a character that can't appear in it? If so, exclude that character. For instance, if it can't contain any HTML tags, you could use a wildcard like: [^<>]+? That will say, "as few characters as possible, not including angle brackets, while allowing the pattern to match."

    Another suggestion: whenever your regex contains forward slashes, use a different regex delimiter so you don't have to backslash the slashes. Even better, learn to use whitespace in your regexes. See how much clearer the second one here is:

    $text =~ /$mm\/$dd\/$yy/; # works, but ugly $text =~ m|$mm/$dd/$yy|; # use different delimiter $text =~ m| $mm # month / $dd # day / $yy # year |x; # allow whitespace

    The third example might seem like overkill for such a simple regex, but the longer and more complicated they get, the more these methods help.

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

Re: Regex keep matching the last possible match (but should get all)
by Anonymous Monk on May 18, 2015 at 13:27 UTC

    Dear Perl-Monks

    Thanks a lot for all the help you provided. I finally got it working. For the next weeks I will gladly delegate every RegEx to other workers here, thats sure.

    if you wonder how it looks at the end... without the fine art of usung other delimiters, still.

    /'extinfo\.cgi[^>]+>([^<]+)<\/A>.{150,190}'status[^>]+>([^<]+)<\/TD><[ +^>]+>([^<]+)<\/TD><[^>]+>([^<]+s)<\/TD><[^>]+>(\d\/\d)<\/TD><[^>]+>([ +^<]+)<\/TD>/g

    I will smooth out this tomorrow, but at least it yield what I was hoping: about 50 Arrays containing the pieces of Information I was looking for.

    Thanks again, and until next time :)

      ... I will gladly delegate every RegEx to other workers ...

      That which does not make your brain explode makes you smarter. I understand that your recent experience with the wily regex has been harrowing, but now is not the time to retreat. Rather, consolidate what you have gained and secure some new footholds.

      One point you seem not to have grasped is the critical difference between greedy and "lazy" (as I like to call it) matching; e.g., between  .+ and  .+? quantification. Please re-read previous sections in this thread bearing on this.

      Perhaps the most important insight to take away is that regexes, much as I love them, are not always the best solution to a problem. Maybe re-consider the advice initially offered by Corion about using a real HTML parser for HTML parsing.

      And yes, please do register as a user.

      Until next time...


      Give a man a fish:  <%-(-(-(-<

        Maybe re-consider the advice initially offered by Corion about using a real HTML parser for HTML parsing.

        Definitely. In 20 years of writing Perl, I've written a lot of long, ugly regexes to pull data out of HTML files as a one-time, quick-and-dirty solution. But I wouldn't count on any of them to be reliable enough to use repeatedly or for automated purposes. For anything reliable, use a module that won't break the day someone changes <TD> to <td> or rearranges a couple of tags.

        If the assignment says to do it with a regex, then that's what you do. But in a real-life parsing task, there's usually a better way.

        Aaron B.
        Available for small or large Perl jobs and *nix system administration; see my home node.