Hello WFSP! hello roboticus! hello dear Community!


Many thanks for the quick reply! And many many thanks to all other poster. Also many thanks to roboticus.
I am very very happy to be here. I am glad! This is a great place to be! sure thing! Many thanks for the quick reply!
All sounds great. Well - i am a beginner on Linux (i run OpenSuse 11.4 milestone1) or on a second machine OpenSuse 11.3 WFSP - and roboticus your approaches look very very impressive!One question comes up to mind: Perhaps i have not seen that you allready have answered this in the code you have written down. I am a bloody newbie. WFSP and roboticus i want to try out both approaches. They look impressive and i am convinced. Here the question:

Where to put the large number of HTML-files, that need to be parsed:

Do i have to call them in the script? How to do that!?
At the moment they are in one folder - (Note; more that 10 000)
I have a large number of HTML-files in a folder. I want to read and extract the content of each HTML-file and create a new single txt file with all the results. I'm only interested in the content having the above mentioned words. WFSP (& roboticus) - all you have written sounds very good and i am convinced.
Ah - yes - the anchor-tag with the e-mail-adress is important too. I want to collect this e-mail-adress too.
All the output should be written in only one new text file. It is important to have some clean output: That means i need to have the text with linebreaks
WFSP - your approach seems to be great - and the output is right that what i want.

THIS (above mentioned Format is great! It is preferred! I like this output

Hit 7 out of 120517
name 1
type: one (for example)
Adress: Paris, 3ne Boulevard Saint Lo
Telefon:048 + 334555664 , Fax: 048 + 334555667
MyWeb-Nummer: 222237520031111
Webmaster:
master
Listed since: 20.08.2002


Superb! I need to have the results of the parsing written in this above mentioned format. All the results shoul be written down in only one text-file. That is important.

Again - the question-(you probably see i am new to linux too):
  • where to store the HTML-Files that need to be parsed!?... (and)
  • where do the results are going to be written to!?

  • Do i have to write these locations into the code. As well as the place where we store the results?

    BTW; on a windows-machine it has to look something like the following. doesn´t it!?

    my $HTML_dir="C:\htmlperl";<br> my $output="C:\htmlperl\output.txt";<br> my $file = $ARGV[0];<br>


    or in general:
    # folder where the HTML-files (that need to be parsed are stored my $html_dir = '/path/to/dir/with/html.files'; # fetch all.html-files from the directory my @html_files = File::Find::Rule->file->name( '*.html')->in( $html_di +r); for my $file ( @html_files ) { # parse the files # store all results that you got from the HTML-files in only one +txt-file. }

    Sorry for the stupid newbie-question!? ;-) But i am very very glad to have found a great (a superb place to be - and to ask all the questions that i have in mind! This is a great place to learn! Many thanks to all you!
    looking forward to hear from you...

    best regards
    perlbeginner1

    In reply to [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1
    in thread Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1

    Title:
    Use:  <p> text here (a paragraph) </p>
    and:  <code> code here </code>
    to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.