Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

hello i want to parse a bunch of html-files that are stored on my computer. I want to parse the data - a certain set of data should be extraced; well a Perl-task That is what i want to get - i want to gather a set of information: country: countryname name: myname School-type: Type one Adress: 20000 New York, Broadway 16 Telefon: 053333052-9899-0, Fax: 053333052-9899-55 index-number: 26666932002 Webmaster: Linus Thorwald site registerd at: 08.03.2010 Website: Well and i can rebuild a url with the index-number: see the html here: (see more below )
<div style="display: inline;"><div class="logo_homepage"><a class="img +_inl" href="http://www.the_search_site.org/26666932002"></a></div>
Well I have to extract the index-number and add it to the shorturl = http://www.the_search_site.org/ (here: 26666932002 ) how to do - how to proceed - to gather the above mentioned results? below the (shortened html of one result):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xh +tml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <!-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswa +hl_2=0&auswahl_3=0&suchtext=&kategorie=&region=de&trefferzahlauswahl= +alle&trefferzahl=10517&list_anfang=0&sort= > <title>result-title: MyName, New York </title> <img src=""Contryname" title="Contryname" /> <div style="width: 40em;"> <div style="display: inline;"><div class="logo_homepage"><a class="img +_inl" href="http://www.the_search_site.org/26666932002"></a></div> <div class="fm_linkeSpalte"><h2>My name</h2> <span class="schulart_text">School-type: Type one</span> <p class="einzel_text">Adress: 20000 New York, Broadway 16 <br /> Telefon: 053333052-9899-0, Fax: 053333052-9899-55 <br /> index-number: 26666932002 <br /> Webmaster: <a href="mailto: webmaster@the-site.com" class="p1">Linu +s Thorwald</a><br /></p> </div> <div> <p class="ta_left einzel_text"> </p></div> <br /><div><p class="ta_left einzel_text">registered at: 08.03.2010</p +></div> </div> </div> </div> </div> <d-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswa +hl_2=0&auswahl_3=0&suchtext=&kategorie=&region=de&trefferzahlauswahl= +alle&trefferzahl=10517&list_anfang=0&sort=--> </html>

Replies are listed 'Best First'.
Re: Data-Parsing: parsing a huge number of files
by graff (Chancellor) on Sep 22, 2010 at 23:50 UTC
    How about if you reply to your own node here, to give us a fresh start on the problem. Please try to structure your next post like this:
    Here is a brief sample of my input data:
    (brief sample of data, with at least one example of each thing to be c +aptured)
    Here is a brief sample of what I want my perl script to produce as output:
    (brief sample, with "# comments" if needed, of what you want to create +)
    Here is the perl code I've tried so far:
    #!/usr/bin/perl use strict; use warnings; # whatever you've got in terms of perl code...

    Organize your thoughts so that you can pose a clear question.

      ve got 25 Tsd files - all are stored in one folder.
      each site contains Adresses (see below) Each data-set has got a unique ID-Number!
      First task is to take all the 25 thousand html-files and to strip out - (parse) the therein contained adress-sets. This is a Perl-task! Sure thing!

      see a dataset:

      Name: Mister Miller
      Adresse:
      Telefon:
      Fax:
      ID-Nummer: 2210202
      Mail-Adress: Mister_Miller@hotmail.com
      Website: short url: http://www.TheWEBsite.org/ID-Number - here 2210202


      The second task can be done with Perl:

      In the last line of Adress-set there is an URL - with a short-way that is build up with two pieces

      http://www.TheWEBsite.org/ID-Number - here 2210202
      in order to rebuild the original URL i have to set the url together and call it.... short url:
      http://www.TheWEBsite.org./ID-Number

      how should i do this second task!?

      look forward to hear from you