in reply to regexp text parsing issue.
I put it that way, because based on the code you posted, it seemed like you weren't actually "extracting" all the tables -- you were just changing "br" tags to line-feeds within tables when the table data did not include 'id="CODE"'.# each table consists of three lines: # the <table> tag (nothing else) on one line # all <tr><td>content</td>...</tr> (nothing else) on the next line # the </table> tag (nothing else) on the third line my ($open,$content); my $Txt = ''; while (<>) { if ( /<tr>/ ) { s/<br>/\n/gi unless ( /id="CODE"/ ); } $Txt .= $_; }
If you really wanted to isolate each instance of table data from the rest of the html, it would be easy enough to add a few extra steps to that sort of loop, in order to do whatever you want with the table lines.
The point is, since you are writing the input data, and can easily enforce a specific formatting style on the HTML text (or you've done this already), then take full advantage of that style in order to simplify down-stream processing. Even if the format you created is different from what I assumed above, it should be easy to work out appropriate rules for "parsing" on a line-by-line basis that will do the right thing.
But as I said, the first reply is also usable. It's shorter, and you have to be familiar with perl's "flip-flop" operator in order to understand it.
(Personally, if I'm deliberately going to avoid using HTML::TokeParser or a similar module for this sort of task, I'd prefer to use a coding style that is more explicit / less obscure, in the sense of using constructs that are relatively basic and more familiar to a broader range of programmers, rather than the more esoteric perl idioms, like flip-flop operators or whole-document regex substitutions that include executable code with embedded substitutions.)
BTW, the reason the OP code was doing the wrong thing (taking all tables together as a single table) was that you were using the "s" operator on the outer regex, so the ".*" of "(.*|\n)" was being greedy, and matching everything from the first <table to the last </table> at once.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: regexp text parsing issue.
by Anonymous Monk on Mar 19, 2005 at 19:45 UTC |