comment on

Hello WFSP! hello roboticus! hello dear Community!

Many thanks for the quick reply! And many many thanks to all other poster. Also many thanks to roboticus.
I am very very happy to be here. I am glad! This is a great place to be! sure thing! Many thanks for the quick reply!
All sounds great. Well - i am a beginner on Linux (i run OpenSuse 11.4 milestone1) or on a second machine OpenSuse 11.3 WFSP - and roboticus your approaches look very very impressive!One question comes up to mind: Perhaps i have not seen that you allready have answered this in the code you have written down. I am a bloody newbie. WFSP and roboticus i want to try out both approaches. They look impressive and i am convinced. Here the question:

Where to put the large number of HTML-files, that need to be parsed:

Do i have to call them in the script? How to do that!?
At the moment they are in one folder - (Note; more that 10 000)
I have a large number of HTML-files in a folder. I want to read and extract the content of each HTML-file and create a new single txt file with all the results. I'm only interested in the content having the above mentioned words. WFSP (& roboticus) - all you have written sounds very good and i am convinced.
Ah - yes - the anchor-tag with the e-mail-adress is important too. I want to collect this e-mail-adress too.
All the output should be written in only one new text file. It is important to have some clean output: That means i need to have the text with linebreaks
WFSP - your approach seems to be great - and the output is right that what i want.

THIS (above mentioned Format is great! It is preferred! I like this output

Hit 7 out of 120517
name 1
type: one (for example)
Adress: Paris, 3ne Boulevard Saint Lo
Telefon:048 + 334555664 , Fax: 048 + 334555667
MyWeb-Nummer: 222237520031111
Webmaster:
master
Listed since: 20.08.2002

Superb! I need to have the results of the parsing written in this above mentioned format. All the results shoul be written down in only one text-file. That is important.

Again - the question-(you probably see i am new to linux too):

where to store the HTML-Files that need to be parsed!?... (and)

where do the results are going to be written to!?

Do i have to write these locations into the code. As well as the place where we store the results?

BTW; on a windows-machine it has to look something like the following. doesn´t it!?

my $HTML_dir="C:\htmlperl";<br>
my $output="C:\htmlperl\output.txt";<br>
my $file = $ARGV[0];<br>
[download]

or in general:

# folder where the HTML-files (that need to be parsed are stored 
my $html_dir = '/path/to/dir/with/html.files';
# fetch  all.html-files from the directory 
my @html_files = File::Find::Rule->file->name( '*.html')->in( $html_di
+r);

for my $file ( @html_files ) {
    # parse the files
    # store all results  that you got from the HTML-files in only one 
+txt-file.
}
[download]

Sorry for the stupid newbie-question!? ;-) But i am very very glad to have found a great (a superb place to be - and to ask all the questions that i have in mind! This is a great place to learn! Many thanks to all you!
looking forward to hear from you...

best regards
perlbeginner1

In reply to [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1
in thread Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.