sv123 has asked for the wisdom of the Perl Monks concerning the following question:

Yet another regex question, within a large html document, there are several blocks of code like below. I'm interested in getting between the
and
tags. So far, I've had no luck. Any help is always appreciated. I'd like to do this with regular expressions if possible. Thanks again!
<div id="blah_999999"> <a href="http://linklinkli.nk" onmouseover="hhjhj" onmouseout="fghfghf +"> <img width="100" height="100" src="http://RANDOM URL" alt="ALT TEXT" i +d="ffdg" /> </a> </div>

Replies are listed 'Best First'.
Re: capturing between divs
by kennethk (Abbot) on Apr 12, 2009 at 16:05 UTC
    The best solutions are those where you use someone else's code, exactly what repellent, gmargo and CountZero suggested in response to your previous query. This is specifically so you don't have to worry about variations and edge cases. For example, how do you know the html even conforms to the standard - browsers are notoriously forgiving of non-conformant hml. It's generally not a good idea to manually parse html or xml since regexes have a hard time with nested data structures. All of this is already handled in HTML::TokeParser::Simple (or any of probably a dozen other HTML parsers on CPAN).

    If you are really motivated to use regular expressions, read perlretut and pay particular attention to non-greedy qualifiers.

Re: capturing between divs
by shmem (Chancellor) on Apr 12, 2009 at 19:06 UTC

      When you have a reasonable idea of what to expect a linear regexp can often be much simpler than a complex nested recursive processing module.

      Unfortunately I've been working with a bunch of buffoons lately who couldn't even perform a simple slurp without importing some module from CPAN. And now we have a bazillion CPAN modules that have to be custom compiled into packages for distribution onto our end systems when some simple code could have alleviated much extra effort for sys admins.

      Blindly relying on CPAN to solve all problems is sheer stupidity. Better to understand the problem first. Why not ask the author of the post whether their divs could ever be nested (e.g. is their source random pages from the internet) etc?

Re: capturing between divs
by Your Mother (Archbishop) on Apr 12, 2009 at 17:54 UTC

    What everyone else is saying. If you're not using an HTML parser, you will have much more brittle code. Regular expressions, except in the most trivial cases, are harder to use correctly than any of the parsers.

      "If you're not using an HTML parser, you will have much more brittle code."

      ..of course if you don't understand how the parser works then you're no better off than if you used a simple regexp in the first place.

      Try this for size:

      while ( $html =~ m{<div[^>]*>(.*?)</div>}sgi ) { my $inside_div = $1; # process contents of $inside_div ... }

      This regexp simply looks for content between div tags. It does not support nested divs.. but if you want complex parsing you're better off using a complex parser.

      The regexp has the s (multi-line), g (global), and i (case-insensitive) flags set.

        but if you want complex parsing you're better off using a complex parser.

        Well... no. If you want correct parsing, you should use a parser. A regex like that can indeed work but it has many edge cases where it will fail and I am personally sick of inheriting code that fails when there are numerous, well-known, deeply tested and vetted ways to solve the problem correctly.

        For quick one-offs or if you know your input intimately a regex on HTML can be okay but for production code it is just Wrong™. Also, I realize you might understand this and just chose poor terms but s means . matches newlines (the single line modifier), m is the multi-line modifier; perlre.