Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to match the lines in the bottom three HTML rows. The first row is matching and returning nothing, and returning an empty list element to the @results array.

1. Why is the first HTML row matching?

It has <br> in there and it's not specified in my regex.

2. How do I only match text with 'and' in there? @results should have only the three last rows and three list elements.

@results should look like:
$VAR1 = '19th Ave and Eighth Street S.E.'; $VAR2 = 'Boser and Liker Trail S.E.'; $VAR3 = 'Lambert and Jerry Drive S.E.';

My Try:
#!/usr/bin/perl -w use strict; use Data::Dumper; my $var = qq( <tr> <td width="651" height="48"><br> This community is grat and has lots to offer.</td> </tr> <tr> <td width="651" height="16">Southeast</td> </tr> <tr> <td width="351" height="9">&nbsp;19th Ave and Eighth Street S.E. +</td> </tr> <tr> <td width="351" height="6">&nbsp;Boser and Liker Trail S.E.</td> </tr> <tr> <td width="351" height="6">&nbsp;Lambert and Jerry Drive S.E.&nb +sp;</td> </tr> ); my @results = $var =~ m/>[&nbsp;]?(.*?)[&nbsp;]?</g; print Dumper @results;

Replies are listed 'Best First'.
Re: Text Parse Question with RegEx
by GrandFather (Saint) on Oct 23, 2006 at 21:54 UTC

    Trying to parse HTML with regexen will make your life unhappy. Instead use one of the many modules designed to do the job. I recommend HTML::TreeBuilder, although for dealing with tables you might also look at HTML::TableContentParser, HTML::TableParser and a slew of other HTML table munging modules.

    Consider:

    use strict; use warnings; use HTML::TreeBuilder; # original my $var stuff here my $tree = HTML::TreeBuilder->new; my @lines; $tree->parse ($var); push @lines, $_->as_text () . "\n" for $tree->find ('tr'); print join "\n", @lines[2..4];

    Prints:

    19th Ave and Eighth Street S.E. Boser and Liker Trail S.E. Lambert and Jerry Drive S.E.

    DWIM is Perl's answer to Gödel
      The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want.

      They only thing that the data has, from which I can parse by, is that 'and' is in the row and the row HTML doesn't have a <br> in it.

        so the following probably is what you want:

        # ... as for first sample my $tree = HTML::TreeBuilder->new; $tree->parse ($var); for ($tree->find ('tr')) { next unless $_->as_text () =~ /\band\b/; next if $_->find ('br'); print $_->as_text () . "\n"; }

        Prints:

        19th Ave and Eighth Street S.E. Boser and Liker Trail S.E. Lambert and Jerry Drive S.E.

        DWIM is Perl's answer to Gödel
        The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want.

        One more reason for not wanting to use regexen and adopt a solution based on a proper HTML parser instead, just as GrandFather suggested.

Re: Text Parse Question with RegEx
by eff_i_g (Curate) on Oct 23, 2006 at 22:14 UTC
    GrandFather's approach is the way to go.

    Here's what your regex is doing:

    [&nbsp;] is really "&" or "n" or "b" or "s" or "p" or ";" since it is inside a character class. This should have been obvious since Dumper shows you "nbsp;", even though it wasn't inside capturing parenthesis, and it mysteriously left off the "&"!

    Also, since everything is optional (you're using two ?'s and a .*?), your minimal match is nothing at all inbetween the tags, which succeeds.

    P.S. Pass a reference to Dumper: print Dumper \@results;