Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:
Your Mother and bart told me to write this up during a conversation in the Chatterbox after talking about getting data manually out of HTML files. I have looked at a lot of HTML parsers out there, but I can't seem to get a grip on how they work. I would like to parse these files into a csv file and separate description files.
Note: get_data_file is a home rolled subroutine I wrote to make it easier for me to get files from my data directory.
I can start the script easily enough...
#!/usr/bin/perl use strict; use warnings; use File::Find; my @files; sub wanted { push @files, $_; } find(\&wanted,C:/Documents and Settings/ME/My Documents/fantasy/Role_p +laying/Magic_items/Spell_scrolls); for $file (@files) { }
The lines in the csv would be...
open(my $spell_csv, '>>', get_data_file('Role_playing','Spell_list.csv +')); print $spell_csv "$spell_name|$school|$level|$range|$duration|$area_of +_effect|$components|$casting_time|$saving_throw|$note"; push @spell_list, $spell_name;
For the description below all of that, the text with all of the html included would be written into a separate .txt file for each spell. If there are any lines in the description that begin with the word Note, put the note in the .csv file.
open(my $spell_description, '>', get_data_file('Role_playing/Spell_des +criptions',"$spell_name.txt")); print $spell_description $description;
After that is all created, get the name of the original file and create a .pl file with the same name in the same directory.
my $html_file = basename($0); my $pl_file = $html_file; $pl_file =~ s!html$!pl!; open(my $new_pl_file, '>', $pl_file); print $new_pl_file q{#!/usr/bin/perl use strict; use warnings; use lib "C:/Documents and Settings/ME/My Documents/fantasy/files/perl/ +lib"; use RolePlaying::SpellList qw(print_spell_scroll); print_spell_scroll(} .join(',',@spell_list). q{);};
Once the files are parsed and the new perl files created, delete the html files.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Parsing HTML into various files
by bart (Canon) on Aug 25, 2010 at 00:21 UTC | |
by Lady_Aleena (Priest) on Aug 25, 2010 at 03:01 UTC | |
by psini (Deacon) on Aug 25, 2010 at 08:48 UTC | |
by bart (Canon) on Aug 25, 2010 at 10:00 UTC | |
by Lady_Aleena (Priest) on Aug 25, 2010 at 17:56 UTC | |
by bart (Canon) on Aug 25, 2010 at 18:43 UTC | |
| |
by psini (Deacon) on Aug 25, 2010 at 18:09 UTC | |
| |
|
Re: Parsing HTML into various files
by wfsp (Abbot) on Aug 25, 2010 at 10:08 UTC | |
by Lady_Aleena (Priest) on Aug 25, 2010 at 18:25 UTC | |
by wfsp (Abbot) on Aug 26, 2010 at 09:25 UTC |