Don't attempt to parse HTML (or XML) with RegExps, that won't work. Use a proper parser, like HTML::Parser.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
| [reply] |
| [reply] |
What is going on with the not greedy regexpr?
Non-greedy stops at earliest position where the pattern will match (before the first </div>).
Greedy stops at latest position where the pattern will match (before the last </div>).
| [reply] [d/l] [select] |
The subpattern is not greedy. However, having a non-greedy subpattern does not mean that of all possible matches anywhere in the string, Perl is going to find the smallest overall match.
It just means that when it's time to do matching of the non-greedy subpattern, of all the possible submatches that start of the current point that will lead to an overall match, it'll pick the shortest submatch.
So, the <div will match the first div, then after the matching for the first div Perl finds the smallest submatch that makes the entire pattern match.
Hence the result you're getting.
If you want to match <div> and </div> with no <div> in between, write your pattern accordingly. Or just use one of the many HTML parsing modules out there. | [reply] [d/l] [select] |
Given these two things you've said so far:
Please note that capital letters from my example represent any HTML string, with other tags included.
I'm just doing some global substitutions to clean up a slurped doc.
I have a hunch the task could be complicated enough that using regexes really isn't the way to go (unless you know something for certain about the parts you need to "clean up" that you haven't mentioned here so far).
In any case, it's definitely worthwhile to work your way out of this misconception, that using a real parser is too hard for a "simple" problem like yours, which, if I understand correctly, involves locating the content of the inner-most "div" within a set of nested divs.
Your own sample string doesn't do justice to your statement of the problem, so here's a basic parser demo that includes a test string "with other tags included" -- it took me about 15 minutes (update: well, less than 30, for sure -- my how time flies), including looking stuff up in the HTML::Parser man page:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my $html = <<EOT;
<html><div>foo_a<div>foo_b
<div>foo_c0 <a href=\"bar\">baz</a> foo_c1</div>
foo_d</div>foo_e</div></html>
EOT
my ( $divtext, $indiv );
my $p = HTML::Parser->new( api_version => 3,
start_h => [\&div_check, "tagname,text"],
text_h => [sub { $divtext .= $_[0] if $indi
+v }, "dtext"],
end_h => [\&work_on_divtext, "tagname,text"
+] );
$p->parse( $html );
sub div_check
{
my ( $tag, $text ) = @_;
if ( $tag eq 'div' ) {
$divtext = '';
$indiv = 1;
}
elsif ( $indiv ) {
$divtext .= $text;
}
}
sub work_on_divtext
{
my ( $tag, $text ) = @_;
if ( $tag eq 'div' ) {
print "=$divtext=\n" if ( $divtext );
$divtext = '';
$indiv = 0;
}
elsif ( $indiv ) {
$divtext .= $text;
}
}
Depending on what you are really trying to accomplish, that script could probably be made a fair bit simpler than it is, but personally, I don't think it's all that complicated, and it's a lot more reliable, flexible, readable, maintainable, etc, etc, than any regex solution I could come up with (assuming I could come up with one, which I'm not sure I'd want to try). | [reply] [d/l] |
Well, I've spend much more than an hour (actually over 4 hours) reading docs on advanced patterns and trying. :-P
Finally, each of the following patterns did what wanted:
$a =~ m%<div\b[^>]*>((?:(?!</div>)(?!<div\b).)*)</div>%;
$a =~ m%<div\b[^>]*>((?:(?!<div\b).)*?)</div>%;
My first attempt was to use the simple "<div\b[^>]*>([^<>]*)</div>" pattern, but it failed when the string contained included any other tag.
The following example shows better what I needed to do (in this case, class Y container being removed):
#!perl;
$a = 'A<div class="X">B<div class="Y">C<span class="Z">D</span>E</div>
+F</div>G';
print "before: $a\n";
$a =~ s%<div\b[^>]*>((?:(?!<div\b).)*?)</div>%\1%g;
print "after: $a\n";
Anyway, there is too much to learn on this topic...
Thanks to everyone!
| [reply] [d/l] [select] |
What is going on with the not greedy regexpr?
Non-greedy stops at earliest position where the pattern will match (before the first </div>).
Greedy stops at latest position where the pattern will match (before the last </div>).
| [reply] [d/l] [select] |
Just thought that starting at <div> after B would result in a better nongreedy result.
Is there a way to match a string contained between two delimiters which does not contain a given word? (not a class)
| [reply] [d/l] [select] |
No, it only affects /.*/. It doesn't affect any other part of the match, namely, the earlier /<div>/.
Is there a way to match a string contained between two delimiters which does not contain a given word? (not a class)
The following would do here:
m{<div>(?:(?!</div>).)*</div>}
| [reply] [d/l] [select] |