Data-Parsing: parsing a huge number of files

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

hello i want to parse a bunch of html-files that are stored on my computer. I want to parse the data - a certain set of data should be extraced; well a Perl-task That is what i want to get - i want to gather a set of information: country: countryname name: myname School-type: Type one Adress: 20000 New York, Broadway 16 Telefon: 053333052-9899-0, Fax: 053333052-9899-55 index-number: 26666932002 Webmaster: Linus Thorwald site registerd at: 08.03.2010 Website: Well and i can rebuild a url with the index-number: see the html here: (see more below )

<div style="display: inline;"><div class="logo_homepage"><a class="img
+_inl" href="http://www.the_search_site.org/26666932002"></a></div>
[download]

Well I have to extract the index-number and add it to the shorturl = http://www.the_search_site.org/ (here: 26666932002 ) how to do - how to proceed - to gather the above mentioned results? below the (shortened html of one result):


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xh
+tml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">


<!-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswa
+hl_2=0&auswahl_3=0&suchtext=&kategorie=&region=de&trefferzahlauswahl=
+alle&trefferzahl=10517&list_anfang=0&sort= >
<title>result-title: MyName, New York </title>
<img src=""Contryname" title="Contryname" />
<div style="width: 40em;">
<div style="display: inline;"><div class="logo_homepage"><a class="img
+_inl" href="http://www.the_search_site.org/26666932002"></a></div>
<div class="fm_linkeSpalte"><h2>My name</h2>
<span class="schulart_text">School-type:  Type one</span>
<p class="einzel_text">Adress: 20000 New York, Broadway 16
<br />
   Telefon: 053333052-9899-0, Fax: 053333052-9899-55
   <br />
  index-number:  26666932002   <br />
  Webmaster:  <a href="mailto: webmaster@the-site.com" class="p1">Linu
+s Thorwald</a><br /></p>                  </div>
        <div>
        <p class="ta_left einzel_text">
                </p></div>
<br /><div><p class="ta_left einzel_text">registered at: 08.03.2010</p
+></div>
    </div>
</div>
</div>
</div>

<d-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswa
+hl_2=0&auswahl_3=0&suchtext=&kategorie=&region=de&trefferzahlauswahl=
+alle&trefferzahl=10517&list_anfang=0&sort=-->
</html>
[download]

Comment on Data-Parsing: parsing a huge number of files Select or Download Code

Replies are listed 'Best First'.
Re: Data-Parsing: parsing a huge number of files by graff (Chancellor) on Sep 22, 2010 at 23:50 UTC
How about if you reply to your own node here, to give us a fresh start on the problem. Please try to structure your next post like this: Here is a brief sample of my input data: `(brief sample of data, with at least one example of each thing to be c +aptured)` [download] Here is a brief sample of what I want my perl script to produce as output: `(brief sample, with "# comments" if needed, of what you want to create +)` [download] Here is the perl code I've tried so far: `#!/usr/bin/perl use strict; use warnings; # whatever you've got in terms of perl code...` [download] Organize your thoughts so that you can pose a clear question.	[reply] [d/l] [select]
Re2: Data-Parsing: parsing a huge number of files by Perlbeginner1 (Scribe) on Sep 23, 2010 at 10:50 UTC
ve got 25 Tsd files - all are stored in one folder. each site contains Adresses (see below) Each data-set has got a unique ID-Number! First task is to take all the 25 thousand html-files and to strip out - (parse) the therein contained adress-sets. This is a Perl-task! Sure thing! see a dataset: Name: Mister Miller Adresse: Telefon: Fax: ID-Nummer: 2210202 Mail-Adress: Mister_Miller@hotmail.com Website: short url: http://www.TheWEBsite.org/ID-Number - here 2210202 The second task can be done with Perl: In the last line of Adress-set there is an URL - with a short-way that is build up with two pieces http://www.TheWEBsite.org/ID-Number - here 2210202 in order to rebuild the original URL i have to set the url together and call it.... short url: http://www.TheWEBsite.org./ID-Number how should i do this second task!? look forward to hear from you	[reply]