html to hash table

phantom85 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, so I am working with WWW::Mechanize to submit form and get class schedule and I have extracted the part of HTML I am interested in. My problem is now I want to put the data in the hash table and I don't know where to start. Following is the code I have so far, and I print it out to file just to test if it returns correct data

use WWW::Mechanize qw();
use IO::Socket::SSL qw();
use HTML::TreeBuilder;
use 5.10.0;
use strict;
use warnings;

my $mech =  WWW::Mechanize->new(ssl_opts => {
    SSL_verify_mode => IO::Socket::SSL::SSL_VERIFY_NONE,
    verify_hostname => 0, });
my $url = "scheduleurl";

$mech->get($url);
my $filename = 'out.htm';
my $result = $mech->submit_form(
form_number => 2,
fields =>
{
    "ctl00\$ContentPlaceHolder1\$TermDDL" => 2171,
    "ctl00\$ContentPlaceHolder1\$ClassSubject" => 'CS',
}
,button => "ctl00\$ContentPlaceHolder1\$SearchButton"
);
$mech->submit();
#print $result->content();
open(my $fhandle, '>', $filename) or die "Could not open file '$filena
+me' $!";
my $tree = HTML::TreeBuilder->new_from_content($result->content);

if (my $div = $tree->look_down(_tag => "div", id => "class_list")){
    #print $div->as_text(), "\n";
#    say $fhandle $div->as_HTML(),"\n";

my @list = $div->find(_tag => 'ol');
#print Dumper \@list;
foreach (@list) {
    say $fhandle $_->as_HTML();
}
}
close $fhandle;
$tree->delete();
[download]

So the script prints this to file I just pasted one item in the list but there are multiple items with the same format.

<ol>
<li><span class="ClassTitle"><strong>CS 128</strong></span> Section 01
+ 
   <table border="0" cellpadding="5" cellspacing="0" class="GridView" 
+id="ClassDetails_TBL" width="99%">
      <tr>
         <th align="right" id="TableHeaderCell8" nowrap>Class Nbr</th>
            <td id="TableCell13">11647</td>
         <th align="right" id="TableHeaderCell9" nowrap>Capacity</th>
            <td id="TableCell14">30</td>
     </tr>
     <tr>
         <th align="right" id="TableCell5" nowrap>Title</th>
            <td class="tablealtstyle" id="TableCell8">Introduction to 
+C++</td>
         <th align="right" id="TableCell8a" nowrap>Units</th>
            <td class="tablealtstyle" id="TableCell9">4</td>
    </tr>
    <tr>
        <th align="right" id="TableCell11" nowrap>Time</th>
           <td id="TableCell1">3:00 PM&ndash;4:50 PM&nbsp;&nbsp;&nbsp;
+TuTh</td>
       <th align="right" id="TableCell15">Building/Room</th>
          <td id="TableCell2">8 52</td>
   </tr>
 </table>
</li>

<li></li>
...
[download]

so I want to put that data in to the hash table with keys being class titles and values are class information.

{
          CS128 Section 01 => { 
                                Class Nbr => 11647,
                                Capacity => 30,
                                Title => Introduction to C++,
                                Units => 4,
                                Time => 3:00PM- 4:50PM TuTh,
                                Room => 8 52
                              }
}
[download]

Comment on html to hash table Select or Download Code

Replies are listed 'Best First'.
Re: html to hash table by Corion (Patriarch) on Oct 30, 2016 at 06:51 UTC
If your data is all in a table, HTML::TableExtract could be what you need. If there is other data, or the data is not all tabular, using an HTML parser together with XPath expression has proven to be a good approach. See for example HTML::TreeBuilder(::LibXML) together with HTML::Selector::XPath.	[reply]
Re: html to hash table by duyet (Friar) on Oct 30, 2016 at 09:29 UTC
Base on the content of your @list: `my $data = {}; foreach my $item ( @list ) { my $span = $tree->look_down( _tag => 'span' ); my $title = $span->as_trimmed_text() . $span->right(); for my $row ( $tree->look_down( _tag => q{tr} )) { my @keys = $row->look_down( _tag => q{th} ); my @vals = $row->look_down( _tag => q{td} ); for ( my $i = 0; $i < scalar @keys; $i++ ) { my $key = $keys[ $i ]->as_trimmed_text(); my $val = $vals[ $i ]->as_trimmed_text(); $data->{ $title }{ $key } = $val; } } } print 'data: ' . Dumper( $data ); }` [download] Result: `data: $VAR1 = { 'CS 128 Section 01 ' => { 'Capacity' => '30', 'Building/Room' => '8 52', 'Title' => 'Introduction to C++', 'Time' => "3:00 PM\x{2013}4:50 PM\ +x{a0}\x{a0}\x{a0}TuTh", 'Class Nbr' => '11647', 'Units' => '4' } };` [download] Look at HTML/Element for more info	[reply] [d/l] [select]
Re^2: html to hash table by Anonymous Monk on Oct 30, 2016 at 22:59 UTC
Thank you it works, one more question if I want to access 'Time' for 'CS 128 Section 01' class how would i do that?	[reply]
Re^3: html to hash table by duyet (Friar) on Oct 31, 2016 at 11:38 UTC
`print $data->{'CS 128 Section 01'}{Time}` [download] If you have more data you can loop thru it: `foreach my $class ( keys %{ $data }) { print $data->{ $class }{Time} # do something else with other items ... }` [download]	[reply] [d/l] [select]
Re: html to hash table by perl-diddler (Chaplain) on Oct 30, 2016 at 22:41 UTC
It seems you are wanting "actions" to be called based on HTML elements. I don't know how well it works, but HTML::Parser (H::P) has options to call your "callout function" for the opening and closing tags of specific HTML elements, or all of them. When you specify the callout functions, you tell H::P what you want passed to your function. For example I wanted to see the start/stop/text and non-parsed text (DATA/javascript) so I specified functions for each (mayka is a anon-sub creation routine that included some routine error checks and such). `$p->parser(HTML::Parser->new("api_version" => 3, start_h => [ mayka($p,6,start_h => \&_start), "tag,skipped_text,attr,attrseq,line,text"], end_h => [ mayka($p,4,end_h => \&_end), "tagname,skipped_text,line,offset_end"], text_h => [ mayka($p,3, text_h => \&_text), "tag,skipped_text,text"], default_h => [ mayka($p,3,default_h => \&_dflt), "event, skipped_text, text"], marked_sections => 1,));` [download] The last parameter specified what elements I wanted passed to my function, so for tags that had class labels, I could store & nest them. Note -- I may easily be missing some functionality in WWW::Mechanize, but I didn't see the ability to process class or ID values when they started and ended. It does get a bit hairy trying to keep track of them, since nested elements preempt and assign class+id's to children, and when those elements end, the class+id revert to whatever was in place before you encountered that element (i.e. need to maintain a stack)... hope this helps...	[reply] [d/l]