fishy has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I'm starting with Raku Grammars.

Parsing a simple HTML table (see below), at first using only regexes (no grammar), I constructed an array of arrays to get out the header and row data.

use v6; my $file_name = "Z2020_G_004_202202161115.html"; my @table_header = (); my @table_data = (); my @file_lines = slurp( $file_name ).split: / \n /; for @file_lines { last if / '<tfoot>' /; if / '<th>' (.+?) '</th>' $ / { @table_header.push: $0; next; } if / '<tr>' $ / { @table_data.push: []; next; } if / '<td>' (.*?) '</td>' $ / { @table_data[*-1].push: $0; } } print( "{@table_header.join: ';'}\n" ); for @table_data { print( "{.join: ';'}\n" ) if .elems; }

I get the expected result:

Clasification;Descrip;Cod Program;Descrip Program;Clasification Progra +m;Credits;Payment ; ;1360; ;Services;150.000,00;62400 0,00; ;20.504,57;20.504,57;Services;0,00;-20.504,57 0,00; ;59.179,70;59.179,70;Services;6.254,79;-59.179,70 0,00; ;16.518,85;16.518,85;Services;0,00;33.481,15

Then I wrote a grammar in order to get the same result.

use v6; # use lib $*PROGRAM.IO.parent.add: 'lib'; # use Grammar::Debugger; # use Grammar::Tracer; my @table_header; my @table_data; grammar html_table { token TOP { <.rubbish>+? <head> <body> <.rubbish>+ } rule head { <.ws> <.theadl> <.ws> <.trl> <hrow>* <.ws> <.trr> <.ws> <.theadr> } rule hrow { <.ws> <.thl> <data> <.thr> { @table_header.push: ~$<data> } } rule body { <.ws> <.tbodyl> [<.ws> <.trl> { @table_data.push: [] } <brow>* <.trr> ]* <.ws> <.tbodyr> } rule brow { <.ws> <.tdl> <data> <.tdr> { @table_data[\*-1].push: ~$<data> } } token theadl { '<thead>' } token theadr { '</thead>' } token tbodyl { '<tbody>' } token tbodyr { '</tbody>' } token trl { '<tr>' } token trr { '</tr>' } token thl { '<th>' } token thr { '</th>' } token tdl { '<td>' } token tdr { '</td>' } regex data { .*? } regex rubbish { \N* \n } } my $file_name = "Z2020_G_004_202202161115.html"; my $file_content = slurp( $file_name ); my $p = html_table.parse( $file_content ); if $p.defined { print( "{@table_header.join: ';'}\n" ); for @table_data { print( "{.join: ';'}\n" ) if .elems; } }

But I don't got the same result:

Clasification;Descrip;Cod Program;Descrip Program;Clasification Progra +m;Credits;Payment ;;1360;;Services;150.000,00;62400;0,00;;20.504,57;20.504,57;Services;0 +,00;-20.504,57;0,00;;59.179,70;59.179,70;Services;6.254,79;-59.179,70 +;0,00;;16.518,85;16.518,85;Services;0,00;33.481,15

It seems that I'm not constructing an array of arrays (just one array).

Do I index right the last element in rule 'brow'?

Grateful for any help.

Also for any comments to improve the grammar or other ways of parsing this simple HTML table (using a grammar, please).

Update:

when leaving out the escaping backslash,

{ @table_data[*-1].push: ~$<data> }

it works as expected!

(got too excited about my first real grammar :-o)
Now it's actions time...

Update:

After having some funny playtime, here is a polished version:

use v6; grammar HTML_table { token TOP { <.rubbish>+? <.ws> '<thead>' <header> <.ws> '</thead>' <.ws> '<tbody>' <row>+ <.ws> '</tbody>' <.rubbish>+ } rule header { <?> '<tr>' ~ '</tr>' <field>* } regex field { <.ws> '<th>' ~ '</th>' (.*?) } rule row { <?> '<tr>' ~ '</tr>' <data>* } regex data { <.ws> '<td>' ~ '</td>' (.*?) } regex rubbish { \N* \n } } class HTML_table_actions { method header($/) { make $<field>>>.made; } method field($/) { # make ~$/[0]; # verbatim make $/[0].defined ?? $/[0].Str.trim !! ''; } method row($/) { make $<data>>>.made; } method data($/) { # make ~$/[0]; # verbatim make $/[0].defined ?? $/[0].Str.trim !! ''; } } my $parser; my @file_list = dir(test => / :i '.' html $ /); my $file_name = @file_list[0].substr: 0, 16; my $output_file = open $file_name ~ ".csv", :w; my $file_content; for @file_list { $file_content = slurp($_, enc => 'iso-8859-1'); say "Parsing: $_"; $parser = HTML_table.parse($file_content, actions => HTML_table_ac +tions.new); unless $parser { say "Unable to parse: $_"; last; }; once { $output_file.print("{ $parser<header>.made.join: ';'; }\n") + }; $output_file.print("{ .join: ';'; }\n") for $parser<row>>>.made; } $output_file.close;

Sample input:

<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="es" lang="es"> <head> <link rel="stylesheet" href="/VisualizadorPortalCiudadano/portal/c +ss/jquery.treeview.css" type="text/css" /> <!-- <script type="text/javascript" src="js/ui.tabs.js"></scri +pt>--> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <meta name="description" content="Short description of your site h +ere." /> <meta name="keywords" content="keywords, go, here, seperated, by, +commas" /> </head> <body onContextMenu="return false;"> <div class="container-fluid"> <div class="panel-heats row well"> <div class="col-sm-4 col-md-4 col-xs-4 col-lg-4"> </div> </div> <div id="Port" class="container-fluid"> <div id="Visual"> <div class="page-header"> </div> <div style="display: none;"> <p id="currentFormat">europeanFormat</p> <div class="europeanFormat"> <p class="decimal-separator">,<p> <p class="grouping-separator">\.<p> </div> <div class="ukFormat"> <p class="decimal-separator">\.<p> <p class="grouping-separator">,<p> </div> </div> </div> <table class="tablesorter"> <thead> <tr> <th>Clasification</th> <th>Descrip</th> <th>Cod Program</th> <th>Descrip Program</th> <th>Clasification Program</th> <th>Credits</th> <th>Payment</th> </tr> </thead> <tbody> <tr> <td></td> <td> </td> <td>1360</td> <td> </td> <td>Services</td> <td>150.000,00</td> <td>62400</td> </tr> <tr> <td>0,00</td> <td> </td> <td>20.504,57</td> <td>20.504,57</td> <td>Services</td> <td>0,00</td> <td>-20.504,57</td> </tr> <tr> <td>0,00</td> <td> </td> <td>59.179,70</td> <td>59.179,70</td> <td>Services</td> <td>6.254,79</td> <td>-59.179,70</td> </tr> <tr> <td>0,00</td> <td> </td> <td>16.518,85</td> <td>16.518,85</td> <td>Services</td> <td>0,00</td> <td>33.481,15</td> </tr> </tbody> <tfoot> <tr> <td colspan="6">Total</td> <td>89.478.403,32</td> <td>32.751.626,25</td> <td>122.230.029,57</td> <td>102.342.399,26</td> <td>89.476.722,29</td> <td>84.657.323,46</td> <td>4.819.398,83</td> <td>32.753.307,28</td> </tr> </tfoot> </table> <div id="loadingPortal" style="display: none;"> <div id="overlay-loading" style="z-index: 1001; position: absolu +te; top: 0; left: 0; background: #aaaaaa url(/images/jquery/dialog/ui +-bg_flat_0_aaaaaa_40x100.png) 50% 50% repeat-x; opacity: .1;filter:Al +pha(Opacity=10);" > </div> <div id="loading-div" style="display: block; position: absolute; + width: 0; top: 0; z-index: 1002; clear: both;"> <img id="loading1" style="position: absolute; z-index: 1002;" +src="images/loading_black.gif" alt="loading" /> </div> </div> </div> </div> </body> </html>