comment on

Dear Monks

Update - sorry for being so rude. But i did not know how to type in the right way .... So here again - and many many thanks for your answers. And i am happy that you did not ban me from this site! So - now i have to retype it again!

could some one enlighten me I thougth myself - that i have to contact you - you are a true expert.
I am working on a solution regarding HTML-Parser.
To be honest i am a Perl-Novice - but i have to do work on a large number of files - sure a Perl-task.

I have a large number (more than 14 000) of HTML files in a folder. I want to read and extract the content of each HTML file to a new txt. I 'm only interested in the content that is inbetween the following words

Hit - till - Listed since .. See the example below!

Note: in each file (and dataset) a URL is included - it is shown in this format href="http://myWeb.org/222237520031111 (a demo ) Since this is a how to say - syntheical (and formal)- URL that has to be "translated" into a real URL How can we do this!?

I want to create a .txt file (only one texte file instead of having many HTML-Files). I use the perl module Parse HTML. On my Linux-Machine there Perl-HTML-Parser Package 3.65-1.10 runs.

There's the code i have until now:
Here the text example - one of more than 14 thousand - all look the same!

Note: there is some html-stuff around this - that is not wanted - and can be stripped out

<br><br>
<h2> Hit 7 out of 120517</h2>
<img src="http://myweb.org/images/wappen/ni.gif" class="wappen_pos" wi
+dth="45" height="53" alt="country" title="countryname" /><br>
<div style="width: 40em;"><br>
<div style="display: inline;"><div class="logo_homepage"><a class="img
+_inl" href="http://myWeb.org/222237520031111"></a></div><br>
<div class="fm_linkeSpalte"><h2>
name 1</h2><br>
<span class="schulart_text">type: one (for example) </span>
<p class="einzel_text">Adress: Paris, 3ne Boulevard Saint Lo
<br /><br>
   Telefon:048 + 334555664  , Fax: 048 + 334555667
   <br />
   MyWeb-Nummer:  222237520031111   <br />
   Webmaster:  <a href="mailto: webmaster@demosite.fr" class="p1">mast
+er</a><br /></p>                  </div>
        <div>
        <p class="ta_left einzel_text">
                </p></div>
<br /><div><p class="ta_left einzel_text">Listed since: 20.08.2002</p>
+</div>
   
<br><br><br><br>
[download]

Note: there is some html-stuff around this - that is not wanted - and can be stripped out!
what do i need:
1. i need to write all into one big texfile:
2. the html-tags have to be stripped
3. in each file (and dataset) a URL is included - it is shown in this format href="http://myWeb.org/222237520031111 (a demo)

Since this is a how to say - syntheical (and formal)- URL that has to be "translated" into a real URL.

Question: How can we do this!?
If you need more explanation or if i can describe more in depth what is needed - just ask.
Any and all help would be greatly appreciated.
.... i love to hear from you!
best regards and all the best to you
Beginner1

In reply to Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.