Your Mother and bart told me to write this up during a conversation in the Chatterbox after talking about getting data manually out of HTML files. I have looked at a lot of HTML parsers out there, but I can't seem to get a grip on how they work. I would like to parse these files into a csv file and separate description files.

Note: get_data_file is a home rolled subroutine I wrote to make it easier for me to get files from my data directory.

I can start the script easily enough...

#!/usr/bin/perl use strict; use warnings; use File::Find; my @files; sub wanted { push @files, $_; } find(\&wanted,C:/Documents and Settings/ME/My Documents/fantasy/Role_p +laying/Magic_items/Spell_scrolls); for $file (@files) { }

The lines in the csv would be...

open(my $spell_csv, '>>', get_data_file('Role_playing','Spell_list.csv +')); print $spell_csv "$spell_name|$school|$level|$range|$duration|$area_of +_effect|$components|$casting_time|$saving_throw|$note"; push @spell_list, $spell_name;

For the description below all of that, the text with all of the html included would be written into a separate .txt file for each spell. If there are any lines in the description that begin with the word Note, put the note in the .csv file.

open(my $spell_description, '>', get_data_file('Role_playing/Spell_des +criptions',"$spell_name.txt")); print $spell_description $description;

After that is all created, get the name of the original file and create a .pl file with the same name in the same directory.

my $html_file = basename($0); my $pl_file = $html_file; $pl_file =~ s!html$!pl!; open(my $new_pl_file, '>', $pl_file); print $new_pl_file q{#!/usr/bin/perl use strict; use warnings; use lib "C:/Documents and Settings/ME/My Documents/fantasy/files/perl/ +lib"; use RolePlaying::SpellList qw(print_spell_scroll); print_spell_scroll(} .join(',',@spell_list). q{);};

Once the files are parsed and the new perl files created, delete the html files.

Have a cookie and a very nice day!
Lady Aleena

In reply to Parsing HTML into various files by Lady_Aleena

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.