Parsing HTML into various files

Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:

Your Mother and bart told me to write this up during a conversation in the Chatterbox after talking about getting data manually out of HTML files. I have looked at a lot of HTML parsers out there, but I can't seem to get a grip on how they work. I would like to parse these files into a csv file and separate description files.

Note: get_data_file is a home rolled subroutine I wrote to make it easier for me to get files from my data directory.

I can start the script easily enough...

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;

my @files;
sub wanted {
  push @files, $_;
}
find(\&wanted,C:/Documents and Settings/ME/My Documents/fantasy/Role_p
+laying/Magic_items/Spell_scrolls);

for $file (@files) {
}
[download]

The lines in the csv would be...

open(my $spell_csv, '>>', get_data_file('Role_playing','Spell_list.csv
+'));
print $spell_csv "$spell_name|$school|$level|$range|$duration|$area_of
+_effect|$components|$casting_time|$saving_throw|$note";
push @spell_list, $spell_name;
[download]

For the description below all of that, the text with all of the html included would be written into a separate .txt file for each spell. If there are any lines in the description that begin with the word Note, put the note in the .csv file.

open(my $spell_description, '>', get_data_file('Role_playing/Spell_des
+criptions',"$spell_name.txt"));
print $spell_description $description;
[download]

After that is all created, get the name of the original file and create a .pl file with the same name in the same directory.

my $html_file = basename($0);
my $pl_file   = $html_file;
   $pl_file   =~ s!html$!pl!;

open(my $new_pl_file, '>', $pl_file);
print $new_pl_file q{#!/usr/bin/perl
use strict;
use warnings;

use lib "C:/Documents and Settings/ME/My Documents/fantasy/files/perl/
+lib";
use RolePlaying::SpellList qw(print_spell_scroll);

print_spell_scroll(}

.join(',',@spell_list).

q{);};
[download]

Once the files are parsed and the new perl files created, delete the html files.

Have a cookie and a very nice day!

Lady Aleena

Comment on Parsing HTML into various files Select or Download Code

Replies are listed 'Best First'.
Re: Parsing HTML into various files by bart (Canon) on Aug 25, 2010 at 00:21 UTC
Here's a bit of code that parses one of your HTML files into a hash of hashes, as an example. I used HTML::TokeParser::Simple because I like how it gives me one token (a start tag, end tag or piece of text) at a time — just like one would read one line at a time from a text file. Now the code itself might look somewhat confusing because I've interwoven the loop of getting the next token with a conditional using `..`, which neatly allows me to extract multiple consecutive tokens from the HTML, between for example a start tag and its associated end tag. That won't work as neatly if you had nested tags of the same type, for example nested `div`s or `table`s — in that case, you would have been forced to count how deep the nesting is to decide if you got to the end of it. But luckily that isn't the case here. The total code is 40-50 lines long, which isn't that bad, I suppose. Enjoy. Read more... (2 kB)	[reply] [d/l] [select]
Re^2: Parsing HTML into various files by Lady_Aleena (Priest) on Aug 25, 2010 at 03:01 UTC
Quick question, does `use strict;` and putting in `my %hash;` change the basic makeup of the script? I am getting the following error after those two changes: `Can't call method "get_token" on an undefined value at C:\Documents an +d Settings\ME\My Documents\fantasy\files\perl\parser.pl line 11.` [download] Line 11 in my copy of the script is `$parser->get_token('table');` Yes, I expanded the variables in the script. :) *Have a cookie and a very nice day!* Lady Aleena	[reply] [d/l] [select]
Re^3: Parsing HTML into various files by psini (Deacon) on Aug 25, 2010 at 08:48 UTC
From the error message it looks like $parser is undef. So it is probably the previous line `my $parser = HTML::TokeParser::Simple->new(...);` [download] which fails. Check if $parser is defined, and if the filename is valid; I don't think that HTML::TokeParser::Simple->new returns an error message, so best chance is that the file name is invalid. Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply] [d/l]
Re^3: Parsing HTML into various files by bart (Canon) on Aug 25, 2010 at 10:00 UTC
psini is absolutely right, adding `use strict` and declaring all variables will not change the working of script at all. So the only explanation I can think of is that it can't read the file. BTW in my case I downloaded the file from the URL and put it right next to the script. Did you forget that? If the file is elsewhere, you have to adjust the file path.	[reply] [d/l]
Re^4: Parsing HTML into various files by Lady_Aleena (Priest) on Aug 25, 2010 at 17:56 UTC
Re^5: Parsing HTML into various files by bart (Canon) on Aug 25, 2010 at 18:43 UTC
Some notes below your chosen depth have not been shown here
Re^5: Parsing HTML into various files by psini (Deacon) on Aug 25, 2010 at 18:09 UTC
Some notes below your chosen depth have not been shown here
Re: Parsing HTML into various files by wfsp (Abbot) on Aug 25, 2010 at 10:08 UTC
For comparison, this uses HTML::TreeBuilder. You have nine groups of seven rows so we process seven rows at a time loading the data into an AoH. I think it reads fairly well and maintaining it ought to be relatively straight forward if your HTML changes. #! /usr/bin/perl use strict; use warnings; use Data::Dumper; use HTML::TreeBuilder; my $file_name = q{la.html}; my $t = HTML::TreeBuilder->new_from_file($file_name) or die qq{cant build tree from $file_name: $!}; my @trs = $t->look_down(_tag => q{tr}); my @db; while (@trs){ my @fields = splice(@trs, 0, 7); my %rec; $rec{group} = $fields[0]->as_text; $rec{type} = $fields[1]->as_text; my @tds; @tds = $fields[2]->look_down(_tag => q{td}); $rec{level} = $tds[1]->as_text; for my $field (3..5){ @tds = $fields[$field]->look_down(_tag => q{td}); $rec{$tds[0]->as_text} = $tds[1]->as_text; $rec{$tds[2]->as_text} = $tds[3]->as_text; } $rec{note} = $fields[6]->as_text; push @db, \%rec; } print Dumper \@db; [download] extract and the note field shortened for brevity: $VAR1 = [ { 'Saving Throw:' => 'None', 'Casting Time:' => '1', 'Area of Effect:' => ' 10 ft.×10 ft./level path', 'Range:' => 'Touch', 'Duration:' => '3 rds. + 2 rds./level', 'note' => ' By means of this spell... or mica.', 'Components:' => 'V, S, M', 'group' => 'Detect Illusion', 'level' => '1', 'type' => '(Divination)(Mentalism)' }, { 'Saving Throw:' => 'Special', 'Casting Time:' => 'Special', 'Area of Effect:' => 'Script reader', 'Range:' => 'Touch', 'Duration:' => '1 day/level', 'note' => ' This spell enables the ....', 'Components:' => 'V, S, M', 'group' => 'Illusionary Script', 'level' => '3', 'type' => '(Illusion/Phantasm)' }, # ... <snipped> ]; [download] Oh, and btw, 38 lines. :-)	[reply] [d/l] [select]
Re^2: Parsing HTML into various files by Lady_Aleena (Priest) on Aug 25, 2010 at 18:25 UTC
For some reason, I am getting the following error. (I checked the file name this time.) `Can't call method "as_text" on an undefined value at C:\..\perl\treebu +ilder.pl line 29.` [download] line 29 `$rec{$tds[2]->as_text} = $tds[3]->as_text;` [download] Sorry I couldn't get it to work right away. Update: Wait, I think I see what might be making things hinky. Update 2: I was working with the wrong batch of files, but even working with the right batch of files is causing the same error. I think it has something to do with the nested tables in some of the descriptions. For files without the nested tables, this works fine. I am thinking that the following should go first with some way of having the tables within it written into the string it creates. `$rec{note} = $fields[6]->as_text; #put nested tables in this one. $rec{group} = $fields[0]->as_text; $rec{type} = $fields[1]->as_text;` [download] *Have a cookie and a very nice day!* Lady Aleena	[reply] [d/l] [select]
Re^3: Parsing HTML into various files by wfsp (Abbot) on Aug 26, 2010 at 09:25 UTC
Ah, I only looked at the first file, I didn't realise some of the others had nested tables. It should be fairly straight forward to accomadate them. I may not have time to look at it today and I'm away for the weekend. I should be able to get back to it on Tuesday.	[reply]