http://qs1969.pair.com?node_id=501370

PerlPilgrim has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have spent several days perusing all of the good ideas on this site about how to parse and manipulate tables, but alas, no one has specifically asked about this particular situation, which I will describe.

Currently, I am using (with permission) vendor web sites which have product data in tabular form. My goal is to integrate their pages into our e-commerce system. I grab the pages with HTTP::Request, modify the pages, and then re-serve them as if they are our own. (This is a simplification - some pages are static, where we use wget as a crontab and store them locally to be polite.) The tables are consistent, and I need to extract part number, descrption and application, and then insert a form to each row, which contains a button to add the item to the shopping cart.

Thus far, my approach, which works nicely (today), but is not the proper approach from what I have read, is to parse the pages using regexp, split and join. My ultimate goal is to use one of the modules to accomplish this in a cleaner, more robust fashion. It looks like HTML::ElementTable is the way to go, but most examples I have seen build the table from scratch. Reading the CPAN docs shows that this module will operate on HTML::Element objects, but the only way I know of to build them from an HTML string is with HTML::TreeBuilder, which appears to be very CPU-hungry.

Is there a better way to create the HML::Element objects from an HTML string? Also, once I do the necessary manipulation, will the as_HTML subroutine recreate the original document satisfactorily? Is this even the direction I want to go with this?

Many thanks in advance for any wisdom that may be shared.

With kind regards,

Mark

Replies are listed 'Best First'.
Re: Table Manipulation
by Roy Johnson (Monsignor) on Oct 19, 2005 at 18:32 UTC
    Sounds like a job for HTML::TableExtract. By our own mojotoad.
    The information from each extracted table is stored in table objects. Tables can be extracted as text, HTML, or HTML::ElementTable structures (for in-place editing).

    Caution: Contents may have been coded under pressure.
      Great, thanks! I checked it out, and if it creates the right objects for HTML::ElementTable to play with, then it should be exactly what I need. I'll mess with it for the next day or so and see what happens. ~Mark
Re: Table Manipulation
by mojotoad (Monsignor) on Oct 19, 2005 at 22:00 UTC
    Hi Mark,

    The functionality bridging between HTML table extraction and the creation of the element structures is fairly new. I'd appreciate any feedback you have.

    In the meantime, I recommend using version 2.06 of HTML::TableExtract at minimum.

    Cheers,
    Matt

      Matt,

      I use pair for web hosting, and they have v.1.08, which is probably why it didn't like the tree method, right? I asked them to upgrade to the latest, and so I'll get back to it when they finish. (I figured installing a local copy was a bit silly if they already have the module installed, even an older version.) I still don't have a code sample of my regexp implementation, I'll get to it sooner or later. I'll try to give you feedback, if I can figure out what is due to my inexperience and what is not. I look forward to trying this out!

      Thanks again,

      Mark
      Matt,

      I have finally gotten TableExtract and (hopefully) all prerequisites installed in my local directory at pair networks, since they didn't seem too keen on installing them centrally for some reason. That done, I have a small snippet of the entire script that I will post here. (The rest of the script is not relevant to this process nor this discussion.) I think this should be enough to stand alone, and at least get the point across. I am getting the following message when I run the script, which tells me that something isn't installed quite right, I think.
      Can't call method "tag" without a package or object reference at /usr/ +home/mllott/perl/lib/perl5/site_perl/5.8.3/HTML/ElementTable.pm line +367. For help, please send mail to the webmaster (mllott@pair.com), giving +this error message and the time and date of the error.
      (Nice how they tell me to contact myself when something goes wrong, I always get a good laugh out of that.) What I am doing is to get a feel for the module by trying out one of your examples on an excerpt of the HTML that I will eventually be processing. I started by commenting out everything, then un-commenting one line at a time. It's the $te->parse($html); line where things stop working.

      Here is the snippet:
      #!/usr/bin/perl -w $html = html_test(); use HTML::TableExtract qw(tree); $te = HTML::TableExtract->new( headers => [qw(DESCRIPTION STOCK_NO)]); $te->parse($html); # $table = $te->first_table_found; # $table_tree = $table->tree; # $table_tree->cell(2,2)->replace_content('Test'); # $table_html = $table_tree->as_HTML; # $table_text = $table_tree->as_text; # $document_tree = $te->tree; # $html = $document_tree->as_HTML; print "Content-type: text/html\n\n".$html; sub html_test { return qq( <html> <head> <title>Test Page</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div id="Layer1" style="position:absolute; left:22px; top:24px; width: +104px; height:29px; z-index:1"><font face="Georgia, Times New Roman, +Times, serif"><b><font size="4">Page-131</font></b></font></div> <table width="525" border="0" cellspacing="1"> <tr align="left" bgcolor="#0033CC"> <td bgcolor="#0033CC"> <div align="center"><font color='whit +e'>DESCRIPTION</font></div></td> <td bgcolor="#0033CC"> <div align="center"><font color="#FFF +FFF">STOCK_NO</font></div></td> </tr> <tr bgcolor="#FFFF00"> <td width="406" align="left">1997-02 6 CYL 4.0L TJ WRANGLER< +/td> <td width="112"> <div align="center">AEM-218300C</div></td> </tr> <tr bgcolor="#FFFFFF"> <td width="406" align="left">1997-02 4 CYL 2.5L TJ WRANGLER< +/td> <td width="112"> <div align="center">AEM-218301C</div></td> </tr> <tr bgcolor="#FFFF00"> <td align="left">2000-UP 6 CYL 3.7L KJ LIBERTY</td> <td> <div align="center">AEM-218302C</div></td> </tr> <tr bgcolor="#FFFFFF"> <td height="22" align="left">1993-98 V8 (BOTH 5.2L&amp; 5.9L +) ZJ GRAND CHEROKEE</td> <td><div align="center">AEM-218303C</div></td> </tr> </table> </body> </html> ); }
      (Once I am able to manipulate tables, I will be using the STOCK_NO field to grab data from a database, and add a price and shopping cart button to the table. This part will be no sweat.)

      Any feedback? Thanks in advance!

      Regards,

      Mark
Re: Table Manipulation
by idsfa (Vicar) on Oct 20, 2005 at 19:22 UTC

    It might also behoove you to look over jZed's AnyData module. It layers over HTML::TableExtract to present either a tied hash or a DBI interface.


    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon