icanwin has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I need one regular expression to match both texts. The text structure is similar, but sometimes I have a corrupted text I don't want to match.

The text1 has the <ul>*</ul> text I need to avoid. The regular expression I have only match the text1, not the text2. I've tried some variations with | without success.

Is it possible to do what I need in just one regular expression?

Note I'm retrieving the text using the $1 and $3. I have to retrieve text using the same variables (I cannot be able to use $1 and $3 for text1 and $1 and $2 for text2).

Any help will be appreciate.

my $text1 = qq{ <div id="aaaa"> text tex text <ul id="ccc">bla bla bla</ul> more text </div> }; my $text2 = qq{ <div id="aaaa"> text text text more text </div> }; my $regex = '<div id="aaaa">(.*)<ul id="ccc">(.*)</ul>(.*)</div>'; if ( $text1 =~ /$regex/sg ) { warn "Text 1 found ".$1.$3; } if ( $text2 =~ /$regex/sg ) { warn "Text 2 found ".$1.$3; }

Replies are listed 'Best First'.
Re: Advanced regular expression help
by moritz (Cardinal) on Sep 12, 2008 at 11:13 UTC

    The old wisdom applies: Parsing HTML with regexes is not good. If it's line based, try to parse it line based.

    However if you insist on using regexes...

    I don't quite get it - do you want the <ul id="ccc">(.*)</ul> part to be optional? If yes, make it optional: (?:<ul id="ccc">(.*)</ul>)?.

    You have to take care that the .* doesn't consume too much text. What do you want the delimiter to be? Newlines? Then use \n or $ or ^ and use the /m modifier.

    Also note that . won't match a newline unless the /s modifier is present (more on that in perlre):

      As moritz suggested. Anyway if you want to use regex and just want to see $1 and $3 try to make the optional part capturing since you don't care about $2. Something like this:
      my $regex = '<div id="aaaa">([.\w\s]*?)(<ul id="ccc">[.\s\w]*?</ul>)?( +[.\s\w]*?)</div>';
      This is the output if that's what you seek:
      Text 1 found text tex text more text Text 2 found text text text more text

      Regards
      s++ą  ł˝ ął. Ş ş şą Żľ ľą˛ş ą ŻĽąş.}++y~-~?-{~/s**$_*ee
Re: Advanced regular expression help
by wfsp (Abbot) on Sep 12, 2008 at 15:55 UTC
    You want all the text (sans tags)?

    Here's my go with a parser

    #!/usr/local/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $text1 = qq{ <div id="aaaa"> text tex text <ul id="ccc">bla bla bla</ul> more text </div> }; my $text2 = qq{ <div id="aaaa"> text text text more text </div> }; my $txt; $txt = retrieve($text1); print $txt; print q{-} x 20; $txt = retrieve($text2); print $txt; sub retrieve{ my $html = shift; my $p = HTML::TokeParser::Simple->new(\$html) or die qq{cant parse text: $!\n}; my $txt; while (my $t = $p->get_token){ $txt .= $t->as_is if $t->is_text; } return $txt; }
    If you have more exacting requirements or if (when) the spec changes this approach is, imo, easier to adapt than a regex approach.

    update: added output

    text tex text bla bla bla more text -------------------- text text text more text