vxp has asked for the wisdom of the Perl Monks concerning the following question:

There's an HTML file to parse. it looks like the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html lang="en-US"><head><title>No title</title> <link rev="made" href="mailto:noreply%40domain.com"> <link rel="stylesheet" type="text/css" href="/css/eman.css"> <script language="javascript" src="/js/eman.js" type="text/javascript" +></script> </head><body onload="eman_form_focus(); " bgcolor="#FFFFFF"> <!-- HOSTNAME: hostname.domain.com --> <form action=View.pcgi> <table border=0 bgcolor='#cccc99' width='100%'><tr><td align=left><fon +t size=+1> WLC download configuration jobs<//font></td><td align=righ +t><input type="submit" name="selector.showb" value="Use Selection Bar +"></td></tr></table><hr width='100%' noshade><table width='625' cells +pacing=0 cellmarging=0 cellpadding=10 border=1> <table class="datatbl" border="1"> <tr align="left" valign="top"> <th class="colhdrinactive" style="text-align : left; "><a class="c +olsortlink" href="http://hostname.domain.com/OPDATA/Config/View.pcgi? +table_sid=bbe93377ed6087e8fa79f7a135af7b2a&table_seq=2&sortby=0&start +row=0&DEVICE=WLC&JOB_TYPE=download&TITLE=1&JOB_STATUS=any" title="Job + Description">Job Description</a></th> <th class="colhdrinactive" style="text-align : left; "><a class="c +olsortlink" href="http://hostname.domain.com/OPDATA/Config/View.pcgi? +table_sid=bbe93377ed6087e8fa79f7a135af7b2a&table_seq=2&sortby=1&start +row=0&DEVICE=WLC&JOB_TYPE=download&TITLE=1&JOB_STATUS=any" title="Job + Owner">Job Owner</a></th> <th class="colhdrinactive" style="text-align : left; "><a class="c +olsortlink" href="http://hostname.domain.com/OPDATA/Config/View.pcgi? +table_sid=bbe93377ed6087e8fa79f7a135af7b2a&table_seq=2&sortby=2&start +row=0&DEVICE=WLC&JOB_TYPE=download&TITLE=1&JOB_STATUS=any" title="Job + Status">Job Status</a></th> <th class="colhdrinactive" style="text-align : left; "><a class="c +olsortlink" href="http://hostname.domain.com/OPDATA/Config/View.pcgi? +table_sid=bbe93377ed6087e8fa79f7a135af7b2a&table_seq=2&sortby=3&start +row=0&DEVICE=WLC&JOB_TYPE=download&TITLE=1&JOB_STATUS=any" title="Tim +estamp">Timestamp</a></th> </tr> <tr class="row1" valign="top" align="left"> <td><a href='Modify.pcgi?bottom=1&SESSION_ID=41f647b1a8c1e6ad9f8bd +25672459223'>WLC Download Summary</a></td> <td>eman</td> <td>running</td> <td>19:13:19 21/Jun/2009 EDT</td> </tr> <tr class="row2" valign="top" align="left"> <td><a href='Modify.pcgi?bottom=1&SESSION_ID=b533b920ee57d39133edf +75c234e8ffc'>WLC Download Summary</a></td> <td>eman</td> <td>running</td> <td>19:55:45 20/Jun/2009 EDT</td> </tr> <tr class="row1" valign="top" align="left"> <td><a href='Modify.pcgi?bottom=1&SESSION_ID=0ce53c9933be10114e6da +3b90940f458'>WLC Download Summary</a></td> <td>eman</td> <td>running</td> <td>19:51:41 19/Jun/2009 EDT</td> </tr> ... and some more stuff just like above.

The task at hand is to get the _first_ "SESSION_ID" value over there. in the example above, that'd be "SESSION_ID=41f647b1a8c1e6ad9f8bd25672459223"

I've looked at various CPAN modules, such as HTML:: TreeBuilder and such - but I'm not sure how to parse that particular field, seeing how there's nothing unique about it, other than it always being the first entry in that html input.

Any input/suggestions appreciated!

PS. Essentially, I guess, I'm looking for the equivalent of the following command, in Perl (cleaner that way :) )

grep SESSION val.html | awk -F'=' '{ print $4 }' | awk -F"'" '{ print $1 }' | head -1

[root@mybox ~]# grep SESSION val.html | awk -F'=' '{ print $4 }' | awk + -F"'" '{ print $1 }' | head -1 41f647b1a8c1e6ad9f8bd25672459223 [root@mybox ~]#

Replies are listed 'Best First'.
Re: Parsing HTML to get a value from a specific table row
by locked_user sundialsvc4 (Abbot) on Jun 22, 2009 at 15:47 UTC

    Without getting into a discussion of exactly which Perl modules to use, I would approach this problem in this way (using existing very-high-level modules throughout):

    1. Parse the HTML file into an XML data-structure.
    2. Use XPath-expressions to search for what you want. (It helps considerably if there is an "id" field.)
    3. Iterate through the search-list, extracting the necessary parts (using XPath again if necessary) and processing them.

    Strange as it may seem, “this sort of task should be as easy as falling off a log.” No amount of fancy, complicated programming should be needed on your part:   all of the heavy-lifting has been done for you.

    Depending on your actual requirements, it might be possible to do the task with no programming at all. XSL stylesheets are an extremely powerful transformational tool.

Re: Parsing HTML to get a value from a specific table row
by metaperl (Curate) on Jun 22, 2009 at 15:34 UTC
    if you use HTML::Element -
    my $tree = HTML::TreeBuilder->new_from_file($file); my $first = $tree->look_down(href => qr/SESSION_ID/); my $val = $first->attr('href'); my ($key, $val) = split '=', $val; warn $val;
      absolutely perfect!
Re: Parsing HTML to get a value from a specific table row
by Anonymous Monk on Jun 22, 2009 at 15:27 UTC