Hello WFSP! hello roboticus! hello dear Community!
Many thanks for the quick reply! And many many thanks to all other poster. Also many thanks to roboticus.
I am very very happy to be here. I am glad! This is a great place to be! sure thing! Many thanks for the quick reply!
All sounds great. Well - i am a beginner on Linux (i run OpenSuse 11.4 milestone1) or on a second machine OpenSuse 11.3 WFSP - and roboticus your approaches look very very impressive!One question comes up to mind: Perhaps i have not seen that you allready have answered this in the code you have written down. I am a bloody newbie. WFSP and roboticus i want to try out both approaches. They look impressive and i am convinced. Here the question:
Where to put the large number of HTML-files, that need to be parsed:
Do i have to call them in the script? How to do that!?
At the moment they are in one folder - (Note; more that 10 000)
I have a large number of HTML-files in a folder. I want to read and extract the content of each HTML-file and create a new single txt file with all the results. I'm only interested in the content having the above mentioned words.
WFSP (& roboticus) - all you have written sounds very good and i am convinced.
Ah - yes - the anchor-tag with the e-mail-adress is important too. I want to collect this e-mail-adress too.
All the output should be written in only one new text file.
It is important to have some clean output: That means i need to have the text with linebreaks
WFSP - your approach seems to be great - and the output is right that what i want.
THIS (above mentioned Format is great! It is preferred! I like this output
Hit 7 out of 120517
name 1
type: one (for example)
Adress: Paris, 3ne Boulevard Saint Lo
Telefon:048 + 334555664 , Fax: 048 + 334555667
MyWeb-Nummer: 222237520031111
Webmaster:
master
Listed since: 20.08.2002
Superb! I need to have the results of the parsing written in this above mentioned format. All the results shoul be
written down in only one text-file. That is important.
Again - the question-(you probably see i am new to linux too):
where to store the HTML-Files that need to be parsed!?... (and)
where do the results are going to be written to!?
Do i have to write these locations into the code. As well as the place where we store the results?
BTW; on a windows-machine it has to look something like the following. doesn´t it!?
my $HTML_dir="C:\htmlperl";<br>
my $output="C:\htmlperl\output.txt";<br>
my $file = $ARGV[0];<br>
or in general:
# folder where the HTML-files (that need to be parsed are stored
my $html_dir = '/path/to/dir/with/html.files';
# fetch all.html-files from the directory
my @html_files = File::Find::Rule->file->name( '*.html')->in( $html_di
+r);
for my $file ( @html_files ) {
# parse the files
# store all results that you got from the HTML-files in only one
+txt-file.
}
Sorry for the stupid newbie-question!? ;-) But i am very very glad to have found a great (a superb place to be - and to ask all the questions that i have in mind! This is a great place to learn! Many thanks to all you!
looking forward to hear from you...
best regards
perlbeginner1
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.