Text Parse Question with RegEx

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to match the lines in the bottom three HTML rows. The first row is matching and returning nothing, and returning an empty list element to the @results array.

1. Why is the first HTML row matching?

It has <br> in there and it's not specified in my regex.

2. How do I only match text with 'and' in there? @results should have only the three last rows and three list elements.

@results should look like:

$VAR1 = '19th Ave and Eighth Street S.E.';
$VAR2 = 'Boser and Liker Trail S.E.';
$VAR3 = 'Lambert and Jerry Drive S.E.';
[download]

My Try:

#!/usr/bin/perl -w

use strict;
use Data::Dumper;


my $var = qq(         <tr>
      <td width="651" height="48"><br>
      This community is grat and
      has lots to offer.</td>
    </tr>
    <tr>
      <td width="651" height="16">Southeast</td>
    </tr>
    <tr>
      <td width="351" height="9">&nbsp;19th Ave and Eighth Street S.E.
+</td>
    </tr>
    <tr>
      <td width="351" height="6">&nbsp;Boser and Liker Trail S.E.</td>
    </tr>
    <tr>
      <td width="351" height="6">&nbsp;Lambert and Jerry Drive S.E.&nb
+sp;</td>
    </tr>
);


my @results = $var =~ m/>[&nbsp;]?(.*?)[&nbsp;]?</g;


print Dumper @results;
[download]

Comment on Text Parse Question with RegEx Select or Download Code

Replies are listed 'Best First'.
Re: Text Parse Question with RegEx by GrandFather (Saint) on Oct 23, 2006 at 21:54 UTC
Trying to parse HTML with regexen will make your life unhappy. Instead use one of the many modules designed to do the job. I recommend HTML::TreeBuilder, although for dealing with tables you might also look at HTML::TableContentParser, HTML::TableParser and a slew of other HTML table munging modules. Consider: `use strict; use warnings; use HTML::TreeBuilder; # original my $var stuff here my $tree = HTML::TreeBuilder->new; my @lines; $tree->parse ($var); push @lines, $_->as_text () . "\n" for $tree->find ('tr'); print join "\n", @lines[2..4];` [download] Prints: `19th Ave and Eighth Street S.E. Boser and Liker Trail S.E. Lambert and Jerry Drive S.E.` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Text Parse Question with RegEx by Anonymous Monk on Oct 23, 2006 at 22:02 UTC
The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want. They only thing that the data has, from which I can parse by, is that 'and' is in the row and the row HTML doesn't have a `<br>` in it.	[reply] [d/l]
Re^3: Text Parse Question with RegEx by GrandFather (Saint) on Oct 23, 2006 at 22:07 UTC
so the following probably is what you want: `# ... as for first sample my $tree = HTML::TreeBuilder->new; $tree->parse ($var); for ($tree->find ('tr')) { next unless $_->as_text () =~ /\band\b/; next if $_->find ('br'); print $_->as_text () . "\n"; }` [download] Prints: `19th Ave and Eighth Street S.E. Boser and Liker Trail S.E. Lambert and Jerry Drive S.E.` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^3: Text Parse Question with RegEx by blazar (Canon) on Oct 24, 2006 at 13:05 UTC
The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want. One more reason for not wanting to use regexen and adopt a solution based on a proper HTML parser instead, just as GrandFather suggested.	[reply]
Re: Text Parse Question with RegEx by eff_i_g (Curate) on Oct 23, 2006 at 22:14 UTC
GrandFather's approach is the way to go. Here's what your regex is doing: `[ ]` is really "&" or "n" or "b" or "s" or "p" or ";" since it is inside a character class. This should have been obvious since Dumper shows you "nbsp;", even though it wasn't inside capturing parenthesis, and it mysteriously left off the "&"! Also, since everything is optional (you're using two ?'s and a .*?), your minimal match is nothing at all inbetween the tags, which succeeds. P.S. Pass a reference to Dumper: `print Dumper \@results;`	[reply] [d/l] [select]