cdherold has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to extract entire sections of text around, and including, a keyword between two symbols. For instance, in the following, the keyword is "ISTF" and I want to get everything between <tr> and </tr>.
<tr> The last company to fail was ISTF with losses of $60M.</tr>
Any help greatly appreciated, Chris I want to use something like:
for ($text =~ /<tr>(.*?)<\/tr>/gmsi){ $selected_text=$1
but I need to get the keyword in there so that it selectively pulls out that sentence, as oppossed to others that are also flanked by <tr> and </tr>.

Replies are listed 'Best First'.
(Ovid) Re: How do I extract all text around a keyword between two symbols?
by Ovid (Cardinal) on Aug 01, 2001 at 00:57 UTC

    Don't use regular expression to parse HTML. There are simply too many ways to write HTML. What happens if the closing tag is </td > (note the white space), for example?

    The following code uses HTML::TokeParser and stuffs the data you want into the @data array.

    #!/usr/bin/perl -w require 5.004; use strict; use HTML::TokeParser; use Data::Dumper; my $doc = shift or &usage; my $p = HTML::TokeParser->new($doc) || die "Can't open: $!"; my @data; # walk through document and get each tag while (my $token = $p->get_tag) { my $tag = $token->[0]; if ( $tag eq 'td' ) { # get_text until a closing 'td' is found push @data, $p->get_trimmed_text( "/td" ); } } print Dumper \@data; sub usage { print "\ttest.pl some.html"; exit; }

    Cheers,
    Ovid

    Vote for paco!

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: How do I extract all text around a keyword between two symbols?
by ChemBoy (Priest) on Aug 01, 2001 at 00:30 UTC

    To do this reliably (assuming that's a real-life example up there), you are going to need to use some derivative of HTML::Parser--probably HTML::TokeParser, (which is recommended for small jobs),or HTML::TableExtract.

    If you really are getting things out of table rows, then of course HTML::TableExtract is an easy call, and it's an easy module to use, and you're set. If a more general HTML solution is needed, then I would recommend something along the lines of the following pseudocode (using TokeParser):

    my ($temp,$flag); while ($token = $parser->get_token) { if ($token eq $start_token) { $flag = 1 } elsif ($token eq $end_token) { $flag = 0; return $temp if $temp =~/KEY/; $temp = ''; } $temp .= $token if $flag; }
    Please note that this is NOT correct syntax for HTML::TokeParser (though it's not as far off as I was originally expecting it to be), it's just an approximation from which (hopefully) you can figure out how to do what you want to do.



    If God had meant us to fly, he would *never* have given us the railroads.
        --Michael Flanders

Re: How do I extract all text around a keyword between two symbols?
by tadman (Prior) on Aug 01, 2001 at 00:18 UTC
    You could use Table::Extract, HTML::Parser, or just have a go with your technique:
    foreach ($text =~ m#<tr>(.*?)</tr>#gmsi) { # Note that foreach assigns the memorized text to $_ if (/\bISTF\b/) { print "$_\n"; } }
Re: How do I extract all text around a keyword between two symbols?
by tachyon (Chancellor) on Aug 01, 2001 at 00:42 UTC

    This works assuming you want quick and dirty rather than using the elegant reliable Parser solution.

    $string =<<'TEXT'; <tr>1 The last company to fail was ISTF with losses of $60M.</tr> <tr>2 The last company to fail was ITF with losses of $60M.</tr> <tr>3 The last company to fail wasISTF with losses of $60M.</tr> < tr>4 The last company to fail was ISTF with losses of $60M.</tr > TEXT print "Found $1\n" while $string =~ m|<\s*tr\s*>([^<]*\bISTF\b[^<]*)<\ +s*/tr\s*>|gi;

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      I'd say that's a little bit too quick and dirty, it breaks easily on sth like: <tr>The <em>last</em> company to fail was ISTF.</tr> Parsing HTML is tricky, that's what the parser modules (mentioned multiple times in this thread) are for.

      -- Hofmator

Re: How do I extract all text around a keyword between two symbols?
by Agermain (Scribe) on Aug 01, 2001 at 00:23 UTC

    I'm no regex master, but if it were me I'd do it in two stages:

    for ($text =~ /<tr>(.*?)<\/tr>/gmsi){ $selected=$1; if ($selected =~ /ISTF/is) { $selected_text = $1 ; ## Your code here } }

    andre germain
    "Wherever you go, there you are."