capturing between divs

sv123 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: capturing between divs by kennethk (Abbot) on Apr 12, 2009 at 16:05 UTC
The best solutions are those where you use someone else's code, exactly what repellent, gmargo and CountZero suggested in response to your previous query. This is specifically so you don't have to worry about variations and edge cases. For example, how do you know the html even conforms to the standard - browsers are notoriously forgiving of non-conformant hml. It's generally not a good idea to manually parse html or xml since regexes have a hard time with nested data structures. All of this is already handled in HTML::TokeParser::Simple (or any of probably a dozen other HTML parsers on CPAN). If you are really motivated to use regular expressions, read perlretut and pay particular attention to non-greedy qualifiers.	[reply]
Re: capturing between divs by shmem (Chancellor) on Apr 12, 2009 at 19:06 UTC
Time to quote myself: Why do I insist in parsing HTML with regular expressions, despite warnings all over the place?	[reply]
Re^2: capturing between divs by monarch (Priest) on Apr 13, 2009 at 04:07 UTC
When you have a reasonable idea of what to expect a linear regexp can often be much simpler than a complex nested recursive processing module. Unfortunately I've been working with a bunch of buffoons lately who couldn't even perform a simple slurp without importing some module from CPAN. And now we have a bazillion CPAN modules that have to be custom compiled into packages for distribution onto our end systems when some simple code could have alleviated much extra effort for sys admins. Blindly relying on CPAN to solve all problems is sheer stupidity. Better to understand the problem first. Why not ask the author of the post whether their divs could ever be nested (e.g. is their source random pages from the internet) etc?	[reply]
Re: capturing between divs by Your Mother (Archbishop) on Apr 12, 2009 at 17:54 UTC
What everyone else is saying. If you're not using an HTML parser, you will have much more brittle code. Regular expressions, except in the most trivial cases, are harder to use correctly than any of the parsers.	[reply]
Re^2: capturing between divs by monarch (Priest) on Apr 13, 2009 at 03:58 UTC
"If you're not using an HTML parser, you will have much more brittle code." ..of course if you don't understand how the parser works then you're no better off than if you used a simple regexp in the first place. Try this for size: `while ( $html =~ m{<div[^>]>(.?)</div>}sgi ) { my $inside_div = $1; # process contents of $inside_div ... }` [download] This regexp simply looks for content between div tags. It does not support nested divs.. but if you want complex parsing you're better off using a complex parser. The regexp has the `s` (multi-line), `g` (global), and `i` (case-insensitive) flags set.	[reply] [d/l] [select]
Re^3: capturing between divs by Your Mother (Archbishop) on Apr 13, 2009 at 06:18 UTC
but if you want complex parsing you're better off using a complex parser. Well... no. If you want correct parsing, you should use a parser. A regex like that can indeed work but it has many edge cases where it will fail and I am personally sick of inheriting code that fails when there are numerous, well-known, deeply tested and vetted ways to solve the problem correctly. For quick one-offs or if you know your input intimately a regex on HTML can be okay but for production code it is just Wrong™. Also, I realize you might understand this and just chose poor terms but `s` means `.` matches newlines (the single line modifier), `m` is the multi-line modifier; perlre.	[reply] [d/l] [select]