mazdajai has asked for the wisdom of the Perl Monks concerning the following question:

I used the look_down method in HTML::TreeBuilder but I am getting duplicates rows. Thoughts anyone?
my $h = HTML::TreeBuilder->new; $h->parse_file($tsmin); my @warnings = $h->look_down( class => qr/Alt(Warning|Error)/ ); foreach my $node (@warnings) { my @filtered = $node->as_HTML( ); say "dump of my @filtered"; # say $fh2 @filtered; }
Above output:
dump of my <tr class="AltWarning" height="22"><td align="middle" class +="AltWarningNoVline" height="17" width="10"></td>< td align="left" class="AltWarning" height="17">Missed</td><td align="l +eft" class="AltWarning" height="17"></td><td align ="left" class="AltWarning" height="17">2015-05-11-18.00</td><td align= +"left" class="AltWarning" height="17"></td><td ali gn="left" class="AltWarning" height="17">NJDLYBACKUP_6PM</td><td align +="left" class="AltWarning" height="17">SERVER1< /td><td align="left" class="AltWarning" height="17">ST13_DOMAIN</td></ +tr> dump of my <td align="middle" class="AltWarningNoVline" height="17" wi +dth="10"></td> dump of my <td align="left" class="AltWarning" height="17">Missed</td> dump of my <td align="left" class="AltWarning" height="17"></td> dump of my <td align="left" class="AltWarning" height="17">2015-05-11- +18.00</td> dump of my <td align="left" class="AltWarning" height="17"></td> dump of my <td align="left" class="AltWarning" height="17">6PM</td> dump of my <td align="left" class="AltWarning" height="17">SERVER1</td +> dump of my <td align="left" class="AltWarning" height="17">DOMAINA</td +>

Replies are listed 'Best First'.
Re: duplicate table with HTML::TreeBuilder look_down method
by kcott (Archbishop) on May 13, 2015 at 21:54 UTC

    G'day mazdajai,

    Welcome to the Monastery.

    In the context of this question, I'll assume "rows" refers to 'tr' elements. Your output only shows one 'tr' element: no duplicates there.

    The single 'tr' element contains a number of 'td' elements. All of these are unique except this, which appears twice:

    <td align="left" class="AltWarning" height="17"></td>

    Accordingly, this 'td' element appears twice in the output.

    Due to a lack of expected output, I can't tell what you want to keep and what you want to discard. See How do I post a question effectively? for details on what to provide to get a better answer from us. Here's a couple of guesses.

    • To get only the 'tr' or 'td' elements, use "_tag => 'wanted_element'" in the look_down() method.
    • To exclude duplicate 'td' elements, use grep. The standard idiom looks like this:
      my %seen; @warnings = grep { ! $seen{$_}++ } @warnings;

    -- Ken

Re: duplicate table with HTML::TreeBuilder look_down method
by codiac (Beadle) on May 14, 2015 at 10:48 UTC
    It's duplicated because the tr has the same class as the tds, so the first node matched contains the tr, which contains all the tds, and then the td's make up the rest of the nodes matched. Just add an extra parameter to the lookdown call so it only searches for td elements.
      Thanks everyone. Kcott, grep would work but I am hopping to use the filters in the look_down. Ken, the TD appears twice in my output and I believe you hit the nail of the issue. I am still working on my filters in the look_down. If I drop _tag => "td", i loss tr because it wasn't in the fitler. What is the correct syntax to nest multiple tags and classes in look_down filters? MY CODE:
      my $h = HTML::TreeBuilder->new; $h->parse_file($tsmin); my @warnings = $h->look_down( _tag => "td", class => qr/Alt(Warning|Error)/ ); foreach my $warning (@warnings) { my @filtered = $warning->as_HTML( ); say "dump of my @filtered"; say $fh2 @filtered; }
      Standard Input:
      <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> <meta name="GENERATOR" content="TSM Reporting"> <meta name="ProgId" content="FrontPage.Editor.Document"> <title>TSM Operational Reporting</title> </head> <DIV class=HeaderBar>Daily Report TSM 24 hour Report for TSM1TSG gener +ated at 2015-05-12 09:00:26 on DIRECTOR covering 2015-05-11 09:00:26 +to 2015-05-12 09:00:25 </DIV> <body> <table border="0" width="100%%"> <DIV class=FooterBar>Server name: <a href="http://TSM1T.example.com:18 +80"> TSM1T</a>, platform: Linux/ppc64, version: 6.3.4.200, date/time: + 05/12/2015 09:00:01</DIV> <tr><td width="100%"><p> <DIV class=HeaderBar>Client Schedules</DIV> <TABLE class=HeaderFrame height=100 cellSpacing=0 cols=3 cellPadding=0 + width="100%" border=0 align="left"> <TR vAlign=top height=100> <TD vAlign=top width="100%" height="100"> <DIV style="overflow: auto; width: "100%"; height: 200; valign: +top"> <TABLE cellSpacing=0 cols=4 cellPadding=0 width="100%" border=0 +height="100"> <TR height=25 nowrap> <TD class=HeaderTitleNoVLine height="14" width="10">&nbsp;</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Status</ +TD> <TD class=HeaderTitle noWrap align=left height="14">Results< +/TD> <TD class=HeaderTitle noWrap align=left height="14">Schedule + Start</TD> <TD class=HeaderTitle noWrap align=left height="14">Actual S +tart</TD> <TD class=HeaderTitle noWrap align=left height="14">Schedule + Name</TD> <TD class=HeaderTitle noWrap align=left height="14">Node Nam +e</TD> <TD class=HeaderTitle noWrap align=left height="14">Domain N +ame</TD></TR> <TR class=AltLight height=22> <TD class=AltLightNoVline align=middle height="17" width="10 +"> </TD> <TD class=AltLight align=left height="17">Completed</TD> <TD class=AltLight align=left height="17">Successful</TD> <TD class=AltLight align=left height="17">2015-05-11-17.00</ +TD> <TD class=AltLight align=left height="17">2015-05-11-17.10</ +TD> <TD class=AltLight align=left height="17">DAILYBACKUP_5PM</T +D> <TD class=AltLight align=left height="17">ServerA</TD> <TD class=AltLight align=left height="17">ST10_DOMAIN</TD></ +TR> <TR class=AltWarning height=22> <TD class=AltWarningNoVline align=middle height="17" width=" +10"> </TD> <TD class=AltWarning align=left height="17">Missed</TD> <TD class=AltWarning align=left height="17"></TD> <TD class=AltWarning align=left height="17">2015-05-11-18.00 +</TD> <TD class=AltWarning align=left height="17"></TD> <TD class=AltWarning align=left height="17">DAILYBACKUP_6PM< +/TD> <TD class=AltWarning align=left height="17">ServerB</TD> <TD class=AltWarning align=left height="17">ST10_DOMAIN</TD> +</TR> <TR class=AltWarning height=22> <TD class=AltWarningNoVline align=middle height="17" width=" +10"> </TD> <TD class=AltWarning align=left height="17">Missed</TD> <TD class=AltWarning align=left height="17"></TD> <TD class=AltWarning align=left height="17">2015-05-11-18.00 +</TD> <TD class=AltWarning align=left height="17"></TD> <TD class=AltWarning align=left height="17">NJDLYBACKUP_6PM< +/TD> <TD class=AltWarning align=left height="17">ServerC</TD> <TD class=AltWarning align=left height="17">ST13_DOMAIN</TD> +</TR> <TR class=AltDark height=22> <TD class=AltDarkNoVline align=middle height="17" width="10" +> </TD> <TD class=AltDark align=left height="17">QATSWAS85</TD> <TD class=AltDark align=left height="17">37899</TD> <TD class=AltDark align=left height="17">104,113</TD> <TD class=AltDark align=left height="17">617</TD> <TD class=AltDark align=left height="17">0</TD> <TD class=AltDark align=left height="17">0</TD> <TD class=AltDark align=left height="17">0</TD> <TD class=AltDark align=left height="17">25</TD> <TD class=AltDark align=left height="17">13</TD> <TD class=AltDark align=left nowrap height="17">251.30 MB</T +D> <TD class=AltDark align=left height="17">00:00:58</TD> <TD class=AltDark align=left height="17">4,378.98</TD> <TD class=AltDark align=left height="17">0%</TD> </TR> <TR class=AltLight height=22> <TD class=AltLightNoVline align=middle height="17" width="10 +"> </TD> <TD class=AltLight align=left height="17">ServerD</TD> <TD class=AltLight align=left height="17">38048</TD> <TD class=AltLight align=left height="17">31,461</TD> <TD class=AltLight align=left height="17">51</TD> <TD class=AltLight align=left height="17">0</TD> <TD class=AltLight align=left height="17">0</TD> <TD class=AltLight align=left height="17">0</TD> <TD class=AltLight align=left height="17">2</TD> <TD class=AltLight align=left height="17">2</TD> <TD class=AltLight align=left nowrap height="17">24.14 MB</T +D> <TD class=AltLight align=left height="17">00:00:12</TD> <TD class=AltLight align=left height="17">1,946.00</TD> <TD class=AltLight align=left height="17">0%</TD> </TR> </TABLE> </DIV></TD> </TR></TABLE> </td> </tr> <tr><td width="100%"><p>
      MY OUTPUT:
      <td align="middle" class="AltWarningNoVline" height="17" width="10"></ +td> <td align="left" class="AltWarning" height="17">Missed</td> <td align="left" class="AltWarning" height="17"></td> <td align="left" class="AltWarning" height="17">2015-05-11-18.00</td> <td align="left" class="AltWarning" height="17"></td> <td align="left" class="AltWarning" height="17">DAILYBACKUP_6PM</td> <td align="left" class="AltWarning" height="17">ServerB</td> <td align="left" class="AltWarning" height="17">ST10_DOMAIN</td> <td align="middle" class="AltWarningNoVline" height="17" width="10"></ +td> <td align="left" class="AltWarning" height="17">Missed</td> <td align="left" class="AltWarning" height="17"></td> <td align="left" class="AltWarning" height="17">2015-05-11-18.00</td> <td align="left" class="AltWarning" height="17"></td> <td align="left" class="AltWarning" height="17">NJDLYBACKUP_6PM</td> <td align="left" class="AltWarning" height="17">ServerC</td> <td align="left" class="AltWarning" height="17">ST13_DOMAIN</td>

        I see what you are trying to do now. You want the first set of <td> elements to be separated from the second set (and any others that might happen to match the search term), correct? There are quite a few ways to do that, this way takes advantage of capturing all <tr> and <td> elements and then uses the presence of a <tr> element to put the next set of <td> elements into a new anonymous array reference:

        Output:

        $VAR1 = [ [ '<td align="middle" class="AltWarningNoVline" height="17" width="1 +0"></td>', '<td align="left" class="AltWarning" height="17">Missed</td>', '<td align="left" class="AltWarning" height="17"></td>', '<td align="left" class="AltWarning" height="17">2015-05-11-18.00< +/td>', '<td align="left" class="AltWarning" height="17"></td>', '<td align="left" class="AltWarning" height="17">DAILYBACKUP_6PM</ +td>', '<td align="left" class="AltWarning" height="17">ServerB</td>', '<td align="left" class="AltWarning" height="17">ST10_DOMAIN</td>' ], [ '<td align="middle" class="AltWarningNoVline" height="17" width="1 +0"></td>', '<td align="left" class="AltWarning" height="17">Missed</td>', '<td align="left" class="AltWarning" height="17"></td>', '<td align="left" class="AltWarning" height="17">2015-05-11-18.00< +/td>', '<td align="left" class="AltWarning" height="17"></td>', '<td align="left" class="AltWarning" height="17">NJDLYBACKUP_6PM</ +td>', '<td align="left" class="AltWarning" height="17">ServerC</td>', '<td align="left" class="AltWarning" height="17">ST13_DOMAIN</td>' ] ];

        There is lots of room for improvement in the code that i wrote, but hopefully this works for you or at least helps you realize your goal.

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: duplicate table with HTML::TreeBuilder look_down method
by GotToBTru (Prior) on May 13, 2015 at 21:11 UTC

    I don't see any duplicates there.

    What do you think you should be getting? Perhaps you could post part of your html file, enough to duplicate the problem.

    Dum Spiro Spero