It looks like the first reply actually provided a working solution. But here's another one, based on the fact that you control the code that generates the html in the first place, and you seem to know for sure that:
# each table consists of three lines: # the <table> tag (nothing else) on one line # all <tr><td>content</td>...</tr> (nothing else) on the next line # the </table> tag (nothing else) on the third line my ($open,$content); my $Txt = ''; while (<>) { if ( /<tr>/ ) { s/<br>/\n/gi unless ( /id="CODE"/ ); } $Txt .= $_; }
I put it that way, because based on the code you posted, it seemed like you weren't actually "extracting" all the tables -- you were just changing "br" tags to line-feeds within tables when the table data did not include 'id="CODE"'.

If you really wanted to isolate each instance of table data from the rest of the html, it would be easy enough to add a few extra steps to that sort of loop, in order to do whatever you want with the table lines.

The point is, since you are writing the input data, and can easily enforce a specific formatting style on the HTML text (or you've done this already), then take full advantage of that style in order to simplify down-stream processing. Even if the format you created is different from what I assumed above, it should be easy to work out appropriate rules for "parsing" on a line-by-line basis that will do the right thing.

But as I said, the first reply is also usable. It's shorter, and you have to be familiar with perl's "flip-flop" operator in order to understand it.

(Personally, if I'm deliberately going to avoid using HTML::TokeParser or a similar module for this sort of task, I'd prefer to use a coding style that is more explicit / less obscure, in the sense of using constructs that are relatively basic and more familiar to a broader range of programmers, rather than the more esoteric perl idioms, like flip-flop operators or whole-document regex substitutions that include executable code with embedded substitutions.)

BTW, the reason the OP code was doing the wrong thing (taking all tables together as a single table) was that you were using the "s" operator on the outer regex, so the ".*" of "(.*|\n)" was being greedy, and matching everything from the first <table to the last </table> at once.


In reply to Re: regexp text parsing issue. by graff
in thread regexp text parsing issue. by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.