Re: regexp text parsing issue.

It looks like the first reply actually provided a working solution. But here's another one, based on the fact that you control the code that generates the html in the first place, and you seem to know for sure that:

# each table consists of three lines:
#  the <table> tag (nothing else) on one line
#  all <tr><td>content</td>...</tr> (nothing else) on the next line
#  the </table> tag (nothing else) on the third line

my ($open,$content);
my $Txt = '';

while (<>) {
   if ( /<tr>/ ) {
       s/<br>/\n/gi unless ( /id="CODE"/ );
   }
   $Txt .= $_;
}
[download]

I put it that way, because based on the code you posted, it seemed like you weren't actually "extracting" all the tables -- you were just changing "br" tags to line-feeds within tables when the table data did not include 'id="CODE"'.

If you really wanted to isolate each instance of table data from the rest of the html, it would be easy enough to add a few extra steps to that sort of loop, in order to do whatever you want with the table lines.

The point is, since you are writing the input data, and can easily enforce a specific formatting style on the HTML text (or you've done this already), then take full advantage of that style in order to simplify down-stream processing. Even if the format you created is different from what I assumed above, it should be easy to work out appropriate rules for "parsing" on a line-by-line basis that will do the right thing.

But as I said, the first reply is also usable. It's shorter, and you have to be familiar with perl's "flip-flop" operator in order to understand it.

(Personally, if I'm deliberately going to avoid using HTML::TokeParser or a similar module for this sort of task, I'd prefer to use a coding style that is more explicit / less obscure, in the sense of using constructs that are relatively basic and more familiar to a broader range of programmers, rather than the more esoteric perl idioms, like flip-flop operators or whole-document regex substitutions that include executable code with embedded substitutions.)

BTW, the reason the OP code was doing the wrong thing (taking all tables together as a single table) was that you were using the "s" operator on the outer regex, so the ".*" of "(.*|\n)" was being greedy, and matching everything from the first <table to the last </table> at once.

Comment on Re: regexp text parsing issue. Download Code

Replies are listed 'Best First'.
Re^2: regexp text parsing issue. by Anonymous Monk on Mar 19, 2005 at 19:45 UTC
Ahh man I am such an idiot! Thanks graff for getting me to see my mistake. Although I did not use your suggestion your attention to the single line matching operator and greedy mode got me thinking about my syntax. The 's' operator was not the culprit in this case but I can see how it could be in others. My issue was the question mark (?) location in the pattern matching. I originally had it outside the character grouping $2 parenthesis so it was not stopping at each </table> tag. All I had to do was move the non-greedy indicator inside the character group $2 after the '.' as such '.?\|\n'. Now it works beautifully! In the end the code that works is: `$Txt =~ s!(<table.?>)(.?\|\n)(</table>)! my $TD = $2; my $first = $1; my $last = $3; unless ($TD =~ /id="CODE"/) {$TD =~ s#<br>#\n#isg;} "$first$TD$last"!eisg;` [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: regexp text parsing issue.
by Anonymous Monk on Mar 19, 2005 at 19:45 UTC

Thanks graff for getting me to see my mistake. Although I did not use your suggestion your attention to the single line matching operator and greedy mode got me thinking about my syntax. The 's' operator was not the culprit in this case but I can see how it could be in others.

My issue was the question mark (?) location in the pattern matching. I originally had it outside the character grouping $2 parenthesis so it was not stopping at each </table> tag. All I had to do was move the non-greedy indicator inside the character group $2 after the '.*' as such '.*?|\n'.

Now it works beautifully!

In the end the code that works is:

$Txt =~ s!(<table.*?>)(.*?|\n)(</table>)!
   my $TD = $2;
   my $first = $1;
   my $last = $3;
   unless ($TD =~ /id="CODE"/) {$TD =~ s#<br>#\n#isg;}
   "$first$TD$last"!eisg;
[download]

[reply]
[d/l]