Re: Text Parse Question with RegEx

Trying to parse HTML with regexen will make your life unhappy. Instead use one of the many modules designed to do the job. I recommend HTML::TreeBuilder, although for dealing with tables you might also look at HTML::TableContentParser, HTML::TableParser and a slew of other HTML table munging modules.

Consider:

use strict;
use warnings;
use HTML::TreeBuilder;

# original my $var stuff here

my $tree = HTML::TreeBuilder->new;
my @lines;

$tree->parse ($var);

push @lines, $_->as_text () . "\n" for $tree->find ('tr');

print join "\n", @lines[2..4];
[download]

Prints:

 19th Ave and Eighth Street S.E. 

 Boser and Liker Trail S.E. 

 Lambert and Jerry Drive S.E.
[download]

DWIM is Perl's answer to Gödel

Comment on Re: Text Parse Question with RegEx Select or Download Code

Replies are listed 'Best First'.
Re^2: Text Parse Question with RegEx by Anonymous Monk on Oct 23, 2006 at 22:02 UTC
The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want. They only thing that the data has, from which I can parse by, is that 'and' is in the row and the row HTML doesn't have a `<br>` in it.	[reply] [d/l]
Re^3: Text Parse Question with RegEx by GrandFather (Saint) on Oct 23, 2006 at 22:07 UTC
so the following probably is what you want: `# ... as for first sample my $tree = HTML::TreeBuilder->new; $tree->parse ($var); for ($tree->find ('tr')) { next unless $_->as_text () =~ /\band\b/; next if $_->find ('br'); print $_->as_text () . "\n"; }` [download] Prints: `19th Ave and Eighth Street S.E. Boser and Liker Trail S.E. Lambert and Jerry Drive S.E.` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^3: Text Parse Question with RegEx by blazar (Canon) on Oct 24, 2006 at 13:05 UTC
The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want. One more reason for not wanting to use regexen and adopt a solution based on a proper HTML parser instead, just as GrandFather suggested.	[reply]