Where to put the large number of HTML-files, that need to be parsed:

Do i have to call them in the script? How to do that!?

I assume you mean "Do I have to name them all in a script?" No, you don't. You can put them anytwhere you like (but preferably not mixed up with the unrelated rest of your files) and use glob in your script to get a complete list of all those files in your script, in one directory -- or possibly even in adjacent directories":
# all html files in one directory my @files = glob 'path/to/dir/*.html';
or
# all html files in all (direct, slibling) subdirectoris in a director +y my @files = glob 'path/to/dir/*/*.html';

If you need an even more elaborate directory structure, then you can use File::Find or one of its derivcatives to find the names of all html files, recursively.

You then continue to parse each file, one at a time.

You can use a regexp substitution to s/\.html$/.txt/ to produce the name for the text file, if you want to put it right beside the original file. You can do a path substitution using abs2rel/rel2abs from File::Spec/File::Spec::Functions to put the new file in a different directory if you want to preserve the directory structure:

use File::Spec::Functions qw(rel2abs abs2rel); my $txt = rel2abs(abs2rel($file, $htmlroot), $txtroot); # relocate $txt =~ s/\.html$/.txt/; # extension

If your directory tree is deep, you may have to create the target directory first, for example with mkpath before attempting to open the text file.

If you want all text files to be in one and the same directory, you can just use File::Basename's basename to strip the directory from the path.


In reply to Re^8: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by bart
in thread Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.