How do I extract all text around a keyword between two symbols?

cdherold has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid) Re: How do I extract all text around a keyword between two symbols? by Ovid (Cardinal) on Aug 01, 2001 at 00:57 UTC
Don't use regular expression to parse HTML. There are simply too many ways to write HTML. What happens if the closing tag is `</td >` (note the white space), for example? The following code uses HTML::TokeParser and stuffs the data you want into the `@data` array. `#!/usr/bin/perl -w require 5.004; use strict; use HTML::TokeParser; use Data::Dumper; my $doc = shift or &usage; my $p = HTML::TokeParser->new($doc) \|\| die "Can't open: $!"; my @data; # walk through document and get each tag while (my $token = $p->get_tag) { my $tag = $token->[0]; if ( $tag eq 'td' ) { # get_text until a closing 'td' is found push @data, $p->get_trimmed_text( "/td" ); } } print Dumper \@data; sub usage { print "\ttest.pl some.html"; exit; }` [download] Cheers, Ovid Vote for paco! Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: How do I extract all text around a keyword between two symbols? by ChemBoy (Priest) on Aug 01, 2001 at 00:30 UTC
To do this reliably (assuming that's a real-life example up there), you are going to need to use some derivative of HTML::Parser--probably HTML::TokeParser, (which is recommended for small jobs),or HTML::TableExtract. If you really are getting things out of table rows, then of course HTML::TableExtract is an easy call, and it's an easy module to use, and you're set. If a more general HTML solution is needed, then I would recommend something along the lines of the following pseudocode (using TokeParser): `my ($temp,$flag); while ($token = $parser->get_token) { if ($token eq $start_token) { $flag = 1 } elsif ($token eq $end_token) { $flag = 0; return $temp if $temp =~/KEY/; $temp = ''; } $temp .= $token if $flag; }` [download] Please note that this is NOT correct syntax for HTML::TokeParser (though it's not as far off as I was originally expecting it to be), it's just an approximation from which (hopefully) you can figure out how to do what you want to do. If God had meant us to fly, he would never have given us the railroads. --Michael Flanders	[reply] [d/l]
Re: How do I extract all text around a keyword between two symbols? by tadman (Prior) on Aug 01, 2001 at 00:18 UTC
You could use Table::Extract, HTML::Parser, or just have a go with your technique: `foreach ($text =~ m#<tr>(.*?)</tr>#gmsi) { # Note that foreach assigns the memorized text to $_ if (/\bISTF\b/) { print "$_\n"; } }` [download]	[reply] [d/l]
Re: How do I extract all text around a keyword between two symbols? by tachyon (Chancellor) on Aug 01, 2001 at 00:42 UTC
This works assuming you want quick and dirty rather than using the elegant reliable Parser solution. `$string =<<'TEXT'; <tr>1 The last company to fail was ISTF with losses of $60M.</tr> <tr>2 The last company to fail was ITF with losses of $60M.</tr> <tr>3 The last company to fail wasISTF with losses of $60M.</tr> < tr>4 The last company to fail was ISTF with losses of $60M.</tr > TEXT print "Found $1\n" while $string =~ m\|<\str\s>([^<]\bISTF\b[^<])<\ +s/tr\s>\|gi;` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: How do I extract all text around a keyword between two symbols? by Hofmator (Curate) on Aug 01, 2001 at 16:31 UTC
I'd say that's a little bit too quick and dirty, it breaks easily on sth like: `<tr>The <em>last</em> company to fail was ISTF.</tr>` Parsing HTML is tricky, that's what the parser modules (mentioned multiple times in this thread) are for. -- Hofmator	[reply] [d/l]
Re: How do I extract all text around a keyword between two symbols? by Agermain (Scribe) on Aug 01, 2001 at 00:23 UTC
I'm no regex master, but if it were me I'd do it in two stages: `for ($text =~ /<tr>(.?)<\/tr>/gmsi){ $selected=$1; if ($selected =~ /ISTF/is) { $selected_text = $1 ; ## Your code here } }` [download] andre germain "Wherever you go, there you are."*	[reply] [d/l]