Advanced regular expression help

icanwin has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I need one regular expression to match both texts. The text structure is similar, but sometimes I have a corrupted text I don't want to match.

The text1 has the <ul>*</ul> text I need to avoid. The regular expression I have only match the text1, not the text2. I've tried some variations with | without success.

Is it possible to do what I need in just one regular expression?

Note I'm retrieving the text using the $1 and $3. I have to retrieve text using the same variables (I cannot be able to use $1 and $3 for text1 and $1 and $2 for text2).

Any help will be appreciate.

my $text1 = qq{
<div id="aaaa">
    text tex text
<ul id="ccc">bla bla bla</ul>
    more text
</div>
};

my $text2 = qq{
<div id="aaaa">
    text text text

    more text
</div>
};

my $regex = '<div id="aaaa">(.*)<ul id="ccc">(.*)</ul>(.*)</div>';
if ( $text1 =~ /$regex/sg ) {
    warn "Text 1 found ".$1.$3;
}
if ( $text2 =~ /$regex/sg ) {
    warn "Text 2 found ".$1.$3;
}
[download]

Comment on Advanced regular expression help Select or Download Code

Replies are listed 'Best First'.
Re: Advanced regular expression help by moritz (Cardinal) on Sep 12, 2008 at 11:13 UTC
The old wisdom applies: Parsing HTML with regexes is not good. If it's line based, try to parse it line based. However if you insist on using regexes... I don't quite get it - do you want the `<ul id="ccc">(.)</ul>` part to be optional? If yes, make it optional: `(?:<ul id="ccc">(.)</ul>)?`. You have to take care that the `.*` doesn't consume too much text. What do you want the delimiter to be? Newlines? Then use `\n` or `$` or `^` and use the /m modifier. Also note that `.` won't match a newline unless the `/s` modifier is present (more on that in perlre):	[reply] [d/l] [select]
Re^2: Advanced regular expression help by Andrew Coolman (Hermit) on Sep 12, 2008 at 18:16 UTC
As moritz suggested. Anyway if you want to use regex and just want to see $1 and $3 try to make the optional part capturing since you don't care about $2. Something like this: `my $regex = '<div id="aaaa">([.\w\s]?)(<ul id="ccc">[.\s\w]?</ul>)?( +[.\s\w]?)</div>';` [download] This is the output if that's what you seek: `Text 1 found text tex text more text Text 2 found text text text more text` [download] Regards s++·ą°µ» ¸Â ł¶˝¬ —¬ął. Ş¨µ ş°» ¨µ« ş»¨ą¬ ¶µ °» Ż¶ľ °» ľ¶ą˛ş ¶ą Ż¶Ľąş.}++y~†-Â~?-{~/s$_ee	[reply] [d/l] [select]
Re: Advanced regular expression help by wfsp (Abbot) on Sep 12, 2008 at 15:55 UTC
You want all the text (sans tags)? Here's my go with a parser #!/usr/local/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $text1 = qq{ <div id="aaaa"> text tex text <ul id="ccc">bla bla bla</ul> more text </div> }; my $text2 = qq{ <div id="aaaa"> text text text more text </div> }; my $txt; $txt = retrieve($text1); print $txt; print q{-} x 20; $txt = retrieve($text2); print $txt; sub retrieve{ my $html = shift; my $p = HTML::TokeParser::Simple->new(\$html) or die qq{cant parse text: $!\n}; my $txt; while (my $t = $p->get_token){ $txt .= $t->as_is if $t->is_text; } return $txt; } [download] If you have more exacting requirements or if (when) the spec changes this approach is, imo, easier to adapt than a regex approach. update: added output `text tex text bla bla bla more text -------------------- text text text more text` [download]	[reply] [d/l] [select]