suzka has asked for the wisdom of the Perl Monks concerning the following question:

I have a simple script to convert html files to text, but I need to do this to many files across multiple directories. I would like to open the files, parse the text, and then save the files to another name so that star.html becomes star.txt. I've got the parsing part done, but I can't quite figure out a) how to handle multiple files and b) change the file extensions. I've read through this site, programming perl, and the perl cookbook, but I don't see how to apply my package again.
#!/usr/bin/perl -w use lib "/export/home/s/ssorocza/myperl"; # parse HTML response package MyParser; open LOG, '>>out.txt' ; use HTML::Parser; use HTML::Entities qw(decode_entities); @ISA = qw(HTML::Parser); sub text { my ($self, $text) = @_; print LOG decode_entities($text); } package main; while ( <> ) { MyParser->new->parse_file($ARGV); }
Thank you. Suzi

Replies are listed 'Best First'.
Re: processing multiple files
by btrott (Parson) on Nov 14, 2000 at 05:38 UTC
    This should work:
    use HTML::Parser; use Symbol; my $TXT = gensym; my $parser = HTML::Parser->new(api_version => 3, text_h => [ sub { print $TXT @_; }, "dtext" ]); for my $html (@ARGV) { (my $txt = $html) =~ s/\.html$/\.txt/; open $TXT, ">$txt" or die "Can't open $txt: $!"; $parser->parse_file($html); close $TXT or die "Can't close $txt: $!"; }
    Yes, I know--I changed a bunch of stuff around. But I find it easier to work w/ the new HTML::Parser API rather than the old. In the new API you set up handlers for specific events, rather than subclassing and adding your own functionality.

    So we create a new parser object and register an event to handle text (just like your text method) by using "text_h" (text handler). We give it a subref to run on that event, then a list of arguments to pass to our subref. "dtext" means text that's been passed through decode_entities, so we don't have to do that ourselves anymore.

    That subref is a closure referring to the $TXT handle, which we open and close each time through the loop. There may be a more elegant way to do this, but it seemed to work for me.

    I wanted to use the -i CL option, but you can't specify the name of the *new* file, only the name of the backup file. So we're stuck with looping through @ARGV and messing about manually with each of the files; but that's not so bad. Unless, of course, I've messed it up. :)

    Each time through the loop we grab the name of the file, then change the .html extension to .txt.

    Save this script and run it like this:

    % ./foo.pl file1.html file2.html ...
RE: processing multiple files
by AgentM (Curate) on Nov 14, 2000 at 05:25 UTC
    Nice object-oriented design but I hope this is just a snippet since a sub alone in a package doesn't do much:o). Answers:
    • a) look at opendir, readdir, and closedir. These functions allow you to read in all of the filenames of a specific directory. With that info, you can easily create your new files with open(FILE,"<myfile.txt"); which will create a new file if it doesn't already exist and nuke an existing file of that name (perhaps you should be checking for this- do you REALLY want to haphazardly overwrite files?)
    • b) Are you deleting the old files? All you really need to do here is use open appropriately as shown above. You're not really changing the file extension, you are changing the entire name. If you are recursing through multiple directories, look into File::Find or File::Recurse and see which one meets your needs. They can only help to simplify and make your code more portable.
    have fun!
    AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.