Dear Monks
Update - sorry for being so rude. But i did not know how to type in the right way .... So here again - and many many thanks for your answers. And i am happy that you did not ban me from this site!
So - now i have to retype it again!
could some one enlighten me
I thougth myself - that i have to contact you - you are a true expert.
I am working on a solution regarding HTML-Parser.
To be honest i am a Perl-Novice - but i have to do work on a large number of files - sure a Perl-task.
I have a large number (more than 14 000) of HTML files in a folder. I want to read and extract the content of each HTML
file to a new txt. I 'm only interested in the content that is inbetween the following words
Hit - till - Listed since .. See the example below!
Note: in each file (and dataset) a URL is included - it is shown in this format href="http://myWeb.org/222237520031111
(a demo ) Since this is a how to say - syntheical (and formal)- URL that has to be "translated" into a real URL
How can we do this!?
I want to create a .txt file (only one texte file instead of having many HTML-Files).
I use the perl module Parse HTML. On my Linux-Machine there Perl-HTML-Parser Package 3.65-1.10 runs.
There's the code i have until now:
Here the text example - one of more than 14 thousand - all look the same!
Note: there is some html-stuff around this - that is not wanted - and can be stripped out
<br><br>
<h2> Hit 7 out of 120517</h2>
<img src="http://myweb.org/images/wappen/ni.gif" class="wappen_pos" wi
+dth="45" height="53" alt="country" title="countryname" /><br>
<div style="width: 40em;"><br>
<div style="display: inline;"><div class="logo_homepage"><a class="img
+_inl" href="http://myWeb.org/222237520031111"></a></div><br>
<div class="fm_linkeSpalte"><h2>
name 1</h2><br>
<span class="schulart_text">type: one (for example) </span>
<p class="einzel_text">Adress: Paris, 3ne Boulevard Saint Lo
<br /><br>
Telefon:048 + 334555664 , Fax: 048 + 334555667
<br />
MyWeb-Nummer: 222237520031111 <br />
Webmaster: <a href="mailto: webmaster@demosite.fr" class="p1">mast
+er</a><br /></p> </div>
<div>
<p class="ta_left einzel_text">
</p></div>
<br /><div><p class="ta_left einzel_text">Listed since: 20.08.2002</p>
+</div>
<br><br><br><br>
Note: there is some html-stuff around this - that is not wanted - and can be stripped out!
what do i need:
1. i need to write all into one big texfile:
2. the html-tags have to be stripped
3. in each file (and dataset) a URL is included - it is shown in this format href="http://myWeb.org/222237520031111
(a demo)
Since this is a how to say - syntheical (and formal)- URL that has to be "translated" into a real URL.
Question: How can we do this!?
If you need more explanation or if i can describe more in depth what is needed - just ask.
Any and all help would be greatly appreciated.
.... i love to hear from you!
best regards and all the best to you
Beginner1
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.