Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to construct a regexp that will do some html parsing for me. I also do not want to use any HTML parsing modules.

What I am trying to do is extract all tables from the source code and parse the data inside the table. I have been able to successfully do this with the following code.

$Txt =~ s!(<table.*?>)(.*|\n)?(</table>)! my $TD = $2; my $first = $1; my $last = $3; unless ($TD =~ /id="CODE"/) { $TD =~ s#<br>#\n#isg; } "$first$TD$last"!eisg;
However if I have 2 or more tables in the source (which is very very common) I only get one single table that encompasses all the tables. It is like it sees the first open html tag and then ignores the rest of the close and opens until it gets to the last close html tag.

For example:

<table> <tr><td>First Table</td</tr> </table> <table> <tr><td>Second Table</td</tr> </table> <table> <tr><td>Third Table</td</tr> </table>
This should parse as three seperate matches but it is only recognized as a single match. I know this has to be a simple thing I am overlooking but just can not see it. Any ideas?

Replies are listed 'Best First'.
Re: regexp text parsing issue.
by sh1tn (Priest) on Mar 19, 2005 at 01:12 UTC
    You may want to see perldoc perlre or at least perldoc perlrequick.
    ... my $reg = { 'open', qr{<table>}, 'close', qr{</table>} }; while( <DATA> ){ if(m#$reg->{open}# ... m#$reg->{close}#){ /$reg->{open}|$reg->{close}/ and next; # print or whatever ... } } ... # STDOUT: # <tr><td>First Table</td</tr> # <tr><td>Second Table</td</tr> # <tr><td>Third Table</td</tr> __DATA__ <table> <tr><td>First Table</td</tr> </table> <table> <tr><td>Second Table</td</tr> </table> <table> <tr><td>Third Table</td</tr> </table>


Re: regexp text parsing issue.
by jhourcle (Prior) on Mar 19, 2005 at 01:14 UTC

    Regular expressions have their uses. SGML parsing is not one of them. You've already found one of those situations where it just simply doesn't work. (also, try embeded tables). It's even worse when you try to deal with badly formatted HTML (and there's a whole lot of it out there, thanks to incorrectly written WYSIWYG editors and 'webmasters' who have no idea what HTML is).

    Would you care to explain your reasons for not wanting to use existing parsers, as it's possible that there may be other ways to solve your problem.

    (I'd personally try to build a tree, if I knew I was always going to be working with well formed SGML, but you haven't even mentioned why you're trying to do this)

      Would you care to explain your reasons for not wanting to use existing parsers, as it's possible that there may be other ways to solve your problem.
      Sure I really do not want to force users to have to install a third party module just to use the application. If the html parser was part of all perl standard distrobution libraries then that may be a possibility. I am not trying to parse others webpages so I am confident that the html I am trying to parse would be the same every time. The html is generated by my cgi script.

        You're trying to parse something, that you're generating from a CGI? So you have control of what's being generated in the first place... then why are you using HTML (which is difficult to parse)? Generate an alternate output, that can be more easily parsed (or directly used by whatever it is that you're trying to do.)

        This is exactly what SOAP, WDDX, XML, and all those other acronyms are for. (although, they do have some overhead, but you're sure to get your data across cleanly) Here's another simple way to pass data out of your CGI:

        use Data::Dumper; print "Content-type: text/plain\n\n",Dumper($my_data);

        CGIs don't have to generate HTML. XML can be your friend. So can plain text, when used right. (tab delim, CSV, etc)

Re: regexp text parsing issue.
by holli (Abbot) on Mar 19, 2005 at 09:51 UTC
    Of course you can use HTML::Parser and code the table parsing by hand, but I suggest using HTML::TableContentParser, which is a subclass of HTML::Parser:
    use strict; use HTML::TableContentParser; my $html = qq{ <table> <tr><td>1</td><td>2</td><td>3</td></tr> <tr><td>4</td><td>5</td><td>6</td></tr> <tr><td>7</td><td>8</td><td>9</td></tr> </table> <table> <tr><td>11</td><td>12</td><td>13</td></tr> <tr><td>14</td><td>15</td><td>16</td></tr> <tr><td>17</td><td>18</td><td>19</td></tr> </table> }; my $p = HTML::TableContentParser->new(); my $tables = $p->parse($html); for my $table (@$tables) { print "new table!\n"; for my $row (@{$table->{rows}}) { print "new row: "; for my $column (@{$row->{cells}}) { print "[$column->{data}] "; } print "\n"; } }
    That prints:
    new table! new row: [1] [2] [3] new row: [4] [5] [6] new row: [7] [8] [9] new table! new row: [11] [12] [13] new row: [14] [15] [16] new row: [17] [18] [19]
    Easy, nice, reliable and clean. Enjoy! ;-)


    holli, /regexed monk/
Re: regexp text parsing issue.
by graff (Chancellor) on Mar 19, 2005 at 15:29 UTC
    It looks like the first reply actually provided a working solution. But here's another one, based on the fact that you control the code that generates the html in the first place, and you seem to know for sure that:
    # each table consists of three lines: # the <table> tag (nothing else) on one line # all <tr><td>content</td>...</tr> (nothing else) on the next line # the </table> tag (nothing else) on the third line my ($open,$content); my $Txt = ''; while (<>) { if ( /<tr>/ ) { s/<br>/\n/gi unless ( /id="CODE"/ ); } $Txt .= $_; }
    I put it that way, because based on the code you posted, it seemed like you weren't actually "extracting" all the tables -- you were just changing "br" tags to line-feeds within tables when the table data did not include 'id="CODE"'.

    If you really wanted to isolate each instance of table data from the rest of the html, it would be easy enough to add a few extra steps to that sort of loop, in order to do whatever you want with the table lines.

    The point is, since you are writing the input data, and can easily enforce a specific formatting style on the HTML text (or you've done this already), then take full advantage of that style in order to simplify down-stream processing. Even if the format you created is different from what I assumed above, it should be easy to work out appropriate rules for "parsing" on a line-by-line basis that will do the right thing.

    But as I said, the first reply is also usable. It's shorter, and you have to be familiar with perl's "flip-flop" operator in order to understand it.

    (Personally, if I'm deliberately going to avoid using HTML::TokeParser or a similar module for this sort of task, I'd prefer to use a coding style that is more explicit / less obscure, in the sense of using constructs that are relatively basic and more familiar to a broader range of programmers, rather than the more esoteric perl idioms, like flip-flop operators or whole-document regex substitutions that include executable code with embedded substitutions.)

    BTW, the reason the OP code was doing the wrong thing (taking all tables together as a single table) was that you were using the "s" operator on the outer regex, so the ".*" of "(.*|\n)" was being greedy, and matching everything from the first <table to the last </table> at once.

      Ahh man I am such an idiot!

      Thanks graff for getting me to see my mistake. Although I did not use your suggestion your attention to the single line matching operator and greedy mode got me thinking about my syntax. The 's' operator was not the culprit in this case but I can see how it could be in others.

      My issue was the question mark (?) location in the pattern matching. I originally had it outside the character grouping $2 parenthesis so it was not stopping at each </table> tag. All I had to do was move the non-greedy indicator inside the character group $2 after the '.*' as such '.*?|\n'.

      Now it works beautifully!

      In the end the code that works is:

      $Txt =~ s!(<table.*?>)(.*?|\n)(</table>)! my $TD = $2; my $first = $1; my $last = $3; unless ($TD =~ /id="CODE"/) {$TD =~ s#<br>#\n#isg;} "$first$TD$last"!eisg;
A reply falls below the community's threshold of quality. You may see it by logging in.