in reply to capturing between divs

What everyone else is saying. If you're not using an HTML parser, you will have much more brittle code. Regular expressions, except in the most trivial cases, are harder to use correctly than any of the parsers.

Replies are listed 'Best First'.
Re^2: capturing between divs
by monarch (Priest) on Apr 13, 2009 at 03:58 UTC

    "If you're not using an HTML parser, you will have much more brittle code."

    ..of course if you don't understand how the parser works then you're no better off than if you used a simple regexp in the first place.

    Try this for size:

    while ( $html =~ m{<div[^>]*>(.*?)</div>}sgi ) { my $inside_div = $1; # process contents of $inside_div ... }

    This regexp simply looks for content between div tags. It does not support nested divs.. but if you want complex parsing you're better off using a complex parser.

    The regexp has the s (multi-line), g (global), and i (case-insensitive) flags set.

      but if you want complex parsing you're better off using a complex parser.

      Well... no. If you want correct parsing, you should use a parser. A regex like that can indeed work but it has many edge cases where it will fail and I am personally sick of inheriting code that fails when there are numerous, well-known, deeply tested and vetted ways to solve the problem correctly.

      For quick one-offs or if you know your input intimately a regex on HTML can be okay but for production code it is just Wrong™. Also, I realize you might understand this and just chose poor terms but s means . matches newlines (the single line modifier), m is the multi-line modifier; perlre.