Re: Advanced regular expression help

You want all the text (sans tags)?

Here's my go with a parser

#!/usr/local/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $text1 = qq{
<div id="aaaa">
    text tex text
<ul id="ccc">bla bla bla</ul>
    more text
</div>
};

my $text2 = qq{
<div id="aaaa">
    text text text

    more text
</div>
};

my $txt;
$txt = retrieve($text1);
print $txt;
print q{-} x 20;
$txt = retrieve($text2);
print $txt;

sub retrieve{
  my $html = shift;
  my $p = HTML::TokeParser::Simple->new(\$html) 
    or die qq{cant parse text: $!\n};
  my $txt;
  while (my $t = $p->get_token){
    $txt .= $t->as_is if $t->is_text;
  }
  return $txt;
}
[download]

If you have more exacting requirements or if (when) the spec changes this approach is, imo, easier to adapt than a regex approach.

update: added output

    text tex text
bla bla bla
    more text

--------------------

    text text text

    more text
[download]

Comment on Re: Advanced regular expression help Select or Download Code