in reply to Text Parse Question with RegEx

Trying to parse HTML with regexen will make your life unhappy. Instead use one of the many modules designed to do the job. I recommend HTML::TreeBuilder, although for dealing with tables you might also look at HTML::TableContentParser, HTML::TableParser and a slew of other HTML table munging modules.

Consider:

use strict; use warnings; use HTML::TreeBuilder; # original my $var stuff here my $tree = HTML::TreeBuilder->new; my @lines; $tree->parse ($var); push @lines, $_->as_text () . "\n" for $tree->find ('tr'); print join "\n", @lines[2..4];

Prints:

19th Ave and Eighth Street S.E. Boser and Liker Trail S.E. Lambert and Jerry Drive S.E.

DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re^2: Text Parse Question with RegEx
by Anonymous Monk on Oct 23, 2006 at 22:02 UTC
    The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want.

    They only thing that the data has, from which I can parse by, is that 'and' is in the row and the row HTML doesn't have a <br> in it.

      so the following probably is what you want:

      # ... as for first sample my $tree = HTML::TreeBuilder->new; $tree->parse ($var); for ($tree->find ('tr')) { next unless $_->as_text () =~ /\band\b/; next if $_->find ('br'); print $_->as_text () . "\n"; }

      Prints:

      19th Ave and Eighth Street S.E. Boser and Liker Trail S.E. Lambert and Jerry Drive S.E.

      DWIM is Perl's answer to Gödel
      The problem is that this data is contained on many different pages and will show up in different rows. The HTML also varies around the rows I want.

      One more reason for not wanting to use regexen and adopt a solution based on a proper HTML parser instead, just as GrandFather suggested.