Dear Monks

Update - sorry for being so rude. But i did not know how to type in the right way .... So here again - and many many thanks for your answers. And i am happy that you did not ban me from this site! So - now i have to retype it again!

could some one enlighten me I thougth myself - that i have to contact you - you are a true expert.
I am working on a solution regarding HTML-Parser.
To be honest i am a Perl-Novice - but i have to do work on a large number of files - sure a Perl-task.

I have a large number (more than 14 000) of HTML files in a folder. I want to read and extract the content of each HTML file to a new txt. I 'm only interested in the content that is inbetween the following words

Hit - till - Listed since .. See the example below!

Note: in each file (and dataset) a URL is included - it is shown in this format href="http://myWeb.org/222237520031111 (a demo ) Since this is a how to say - syntheical (and formal)- URL that has to be "translated" into a real URL How can we do this!?

I want to create a .txt file (only one texte file instead of having many HTML-Files). I use the perl module Parse HTML. On my Linux-Machine there Perl-HTML-Parser Package 3.65-1.10 runs.

There's the code i have until now:
Here the text example - one of more than 14 thousand - all look the same!

Note: there is some html-stuff around this - that is not wanted - and can be stripped out
<br><br> <h2> Hit 7 out of 120517</h2> <img src="http://myweb.org/images/wappen/ni.gif" class="wappen_pos" wi +dth="45" height="53" alt="country" title="countryname" /><br> <div style="width: 40em;"><br> <div style="display: inline;"><div class="logo_homepage"><a class="img +_inl" href="http://myWeb.org/222237520031111"></a></div><br> <div class="fm_linkeSpalte"><h2> name 1</h2><br> <span class="schulart_text">type: one (for example) </span> <p class="einzel_text">Adress: Paris, 3ne Boulevard Saint Lo <br /><br> Telefon:048 + 334555664 , Fax: 048 + 334555667 <br /> MyWeb-Nummer: 222237520031111 <br /> Webmaster: <a href="mailto: webmaster@demosite.fr" class="p1">mast +er</a><br /></p> </div> <div> <p class="ta_left einzel_text"> </p></div> <br /><div><p class="ta_left einzel_text">Listed since: 20.08.2002</p> +</div> <br><br><br><br>
Note: there is some html-stuff around this - that is not wanted - and can be stripped out!
what do i need:
1. i need to write all into one big texfile:
2. the html-tags have to be stripped
3. in each file (and dataset) a URL is included - it is shown in this format href="http://myWeb.org/222237520031111 (a demo)

Since this is a how to say - syntheical (and formal)- URL that has to be "translated" into a real URL.

Question: How can we do this!?
If you need more explanation or if i can describe more in depth what is needed - just ask.
Any and all help would be greatly appreciated.
.... i love to hear from you!
best regards and all the best to you
Beginner1

In reply to Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.