Where to put the large number of HTML-files, that need to be parsed:I assume you mean "Do I have to name them all in a script?" No, you don't. You can put them anytwhere you like (but preferably not mixed up with the unrelated rest of your files) and use glob in your script to get a complete list of all those files in your script, in one directory -- or possibly even in adjacent directories":Do i have to call them in the script? How to do that!?
or# all html files in one directory my @files = glob 'path/to/dir/*.html';
# all html files in all (direct, slibling) subdirectoris in a director +y my @files = glob 'path/to/dir/*/*.html';
If you need an even more elaborate directory structure, then you can use File::Find or one of its derivcatives to find the names of all html files, recursively.
You then continue to parse each file, one at a time.
You can use a regexp substitution to s/\.html$/.txt/ to produce the name for the text file, if you want to put it right beside the original file. You can do a path substitution using abs2rel/rel2abs from File::Spec/File::Spec::Functions to put the new file in a different directory if you want to preserve the directory structure:
use File::Spec::Functions qw(rel2abs abs2rel); my $txt = rel2abs(abs2rel($file, $htmlroot), $txtroot); # relocate $txt =~ s/\.html$/.txt/; # extension
If your directory tree is deep, you may have to create the target directory first, for example with mkpath before attempting to open the text file.
If you want all text files to be in one and the same directory, you can just use File::Basename's basename to strip the directory from the path.
In reply to Re^8: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by bart
in thread Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by Perlbeginner1
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |