Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

good evening dear Monks!


his might sound like a strange question, but bear with me.. My mind just went blank when i got a ware this job!

Well i have 5000 files which have to be parsed - in order to strip of the HTML: the good thing is: In each HTML-file i have to get only one line of text - the following is of interest:
In line 999 i have the following results:

Well - how to do the parser-job: can i tell the HTML-Parser that i only have to get the line 999 ? Note: The data shold be stored in a database:

</p><h1>dataset 1:</h1> &nbsp;<table border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin=" +5"><tr> <td><strong>name:</strong>&nbsp;</td> <td width=500> myname one + </td></tr><tr> <td><strong>type:</strong>&nbsp;</td> <td width=500> type_on +e (04313488) </td></tr><tr> <td><strong>aresss:</strong>&nbsp;</td><td>Friedrichstr. 70,&nbsp;7343 +0&nbsp;Madrid</td></tr><tr> <td><strong>adresse_two:</strong>&nbsp;</td> <td> no_value + </td></tr><tr> <td><strong>telefone:</strong>&nbsp;</td> <td> 0000736111/68 +0040 </td></tr><tr> <td><strong>Fax:</strong>&nbsp;</td> <td> 0000736111/680040 + </td></tr><tr> <td><strong>E-Mail:</strong>&nbsp;</td> <td> Keine Angabe + </td></tr><tr> <td><strong>Internet:</strong>&nbsp;</td><td><a href="http://www.mysit +e.es" target="_blank">www.mysite.es</a><br></td></tr><tr> <td><strong +>the office:</strong>&nbsp;</td> <td><a href="http://www.mysite_two" target="_blank">mysite_two </a><br +></td></tr><tr> <td><strong>:</strong>&nbsp;</td><td> no_value </td></tr><tr> <td><strong>officer:</strong>&nbsp;</td> <td> no_value </td> +</td></tr><tr> <td><strong>employees:</strong>&nbsp;</td> <td> 259 </td></tr> +<tr> <td><strong>offices:</strong>&nbsp;</td> <td> 8 </td></tr> +<tr> <td><strong>worker:</strong>&nbsp;</td> <td> no_value </td +></tr><tr> <td><strong>country:</strong>&nbsp;</td> <td> contryname </ +td></tr><tr> <td><strong>the_council:</strong>&nbsp;</td> <td>


Well - the question is - is it possible to do the search in the 5000 files with this attribute: that the line 999 is of interest. In other words - can i tell the HTML-paerser that he has to look (/and extract) exactly the line 999!?

Look forward to any and all ideas.
  • Comment on HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
  • Download Code

Replies are listed 'Best First'.
Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by morgon (Priest) on Oct 15, 2010 at 23:42 UTC
    I hope I understand you in the right way, here is what I would try:

    1) Copy your files to some working directory.

    2) Run perl -i.old -ne 'print if $.==999' *html in that directory. That extracts line 999 from your files.

    3) As the files contain now only a html-fragment we make it valid html again like this: perl -i.old -ne 'print "<html><body>$_</body></html>' *html

    4) You have now a collection of html-files consisting only of the previous line 999 that you can parse further with whatever tool you want.

      i'd use HTML::TokeParser::Simple and DBI and put your line directly into your db. also, if your files are in separate directories look at find2perl (or File::Find). i too would advise against just relying on the line number. however, if you want to do that, it can be done just as easy from straight command line: find -type f -print0 | xargs -0 -i{} head -1000 {} | tail -1 | while read string; do mysql -e 'insert query $string'; done that said, the perl would be faster and TokeParser would allow for more reliable data. the pers modules i stated above are pretty straight forward to use, but since i've just done (from LWP and not from files) i can pretty much get you a template for this in a few minutes if you need.
        hello ag4ve, many thanks for the posting!


        i like your idea of using HTML::TokeParser::Simple and DBI. I have little experience with HTML::TokeParser::Simple but To to this task on my own -it would go over my head - at least at the moment! Note: i also have had a look at the ideas of morgon - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme

        @ag4gee: I think i would love to go your way: and do it with HTML::TokeParser::Simple and DBI.

        I guess that i have to do it with the other items too: in order to get the full information set that is wanted:
        See one of the example sites:

        http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

        in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

        That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI. That would be great!!

        I look forward to hear from you

        best regards perlbeginner1

        Note: see the 5000 sites - with all the infos - on a german official governmental server...:
        http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/schnellsuche.php

        do a search with *.* -> then you get the result pages - All is available for the whole world - and as i work in the filed of education, nothing is wrong with doing the parsing-job!


        Note your code just looks great!! Really:


        use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text;
      Hello Morgon, good evening! Many many thanks for the quick reply!

      you understood right! That is exactly what is wanted

      i like your advices!

      the fourth is the point where i am now! Now i have 5000 files with one line that is extracted:

      you say: 4) You have now a collection of html-files consisting only of the previous line 999 that you can parse further with whatever tool you want.

      how to proceed. Note i wwant to store this results in a db. Can i go and replace some html-elements with CSV !!? is this doable. Or should i parse the results with html-tokeParser!? I have little experience - but i guess that this job goes over my heard! ;-)

      look forward to hear from you
        You simply have to extract the data-fields now.

        There are several ways to do it.

        As we've gotten rid of a lot of crap by only using one line of the original html, you could use a regular expression here, but that is in general not a good idea for decomposing html.

        Personally I like HTML::TreeBuilder::XPath that you would have to install from CPAN.

        Here is how you would then extract the name from one of the files with it:

        use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text;
        As you can see you simply use an xpath-expression to indentify the node you want.

        So how to determine that?

        I use a Firefox-plugin called XPather, that allows you to simply click on a html-element and extract the corresponding xpath.

        So you load the file you want to parse in Firefox, click on the stuff you want, get the xpath and use that in the perl-script.

        Hope that gets you started...

Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by BrowserUk (Patriarch) on Oct 16, 2010 at 04:32 UTC
    Look forward to any and all ideas.

    I'd use perl -ne"$. == 999 and print" > all999lines.txt to put all the lines in one file.

    Then something like:

    #! perl -slw use strict; use Data::Dump qw[pp]; while( <> ) { my %record = m[ <strong>([^<]+?):</strong>.+? >\s*([^<]+?)\s*</(?:a|td)> ]xg; pp \%record; }

    Output:

    c:\test>junk72 { "E-Mail" => "Keine Angabe", Fax => "0000736111/680040", Internet => "www.mysite.es", adresse_two => "no_value", aresss => "Friedrichstr. 70,&nbsp;73430&nbsp;Madrid", country => "contryname", employees => 259, name => "myname one", officer => "no_value", offices => 8, telefone => "0000736111/680040", "the office" => "mysite_two", type => "type_one (04313488)", worker => "no_value", }

    Once you have the record in a hash, pushing into the db shouldn't be a problem.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hello BrowserUK


      many thanks for the posting! That looks very very impressive: i am happy!

      btw: to verify the things - see here the task in a more descriptive way: so i decide to use PERL - since it is very very powerful - i try to nail down the issues while using PERL.

      See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488
      in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

      BrowserUk - you gave very very useful hints. Thanks for all!

      i am very happy to have a template that can be runned with and to be stored in the mysql database

      would be great!!

      regards Perl-Beginner!
Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by ww (Archbishop) on Oct 16, 2010 at 00:35 UTC
    Perlbeginner1:

    The concept of "line numbers" in html is -- at best -- ill-defined. Just as Perl (by and large) doesn't care about white space, html really doesn't blink an eye (nor flinch) if an entire page is on a single line.

    Consider this (incomplete, but adequate for illustration):

    <html><head><title>Example of one line html</title></head><body><p>The + source for this page, which contains multiple paragraphs, is one sin +gle line. There are no line-breaks in the source; just one monolithic + line with all the tags and body-content run together.</p><p>So if we +'re looking for line 2, where is it?</p><p>We'll come up with one ans +wer if we rely on the rendering by a browser, and something entirely +different (there is no line 2!), if we view the source.</p><body></ht +ml>

    And compare that to a somewhat friendlier format:

    <html> <head> <title>Example of multi-line html</title> </head> <body> <p>The source for this page, which contains multiple paragraphs, is mu +ltiple lines. There are line-breaks in the source; which is not just + one monolithic line with all the tags and body-content run together. +</p> <p>So if we're looking for line 2, where is it?</p> <p>We'll come up with one answer if we rely on the rendering by a brow +ser, and something entirely different (there is no line 2 in the prio +r example!), if we view the source.</p> <body> </html>

    The two will render identically except for the minor changes I made in the renderable text, for the sake of making the statements true in both pages. BTW, the line numbers are not in the source, but appear as a result of the workings of the Monastery's <c>...<c> tags, while the red plus-signs are also absent from the source but are artifacts of the rendering here (indicating line continuations where that's not otherwise obvious).

    Go ahead; try it. Download the two code blocks above; save them as "nobreak.html" and "breaks.html" respectively... then open each in your browser.

    Then, go back and rethink your spec. Unless all the 5000 files are produced by some sort of automaton -- a script, for example, with variability provided by arguments from elsewhere -- it's unlikely that whatever your target-of-interest may be that it will reliably be line 9, 99, or 999. With html, you need something that has intrinsic meaning.

    And that might be some phrase which will begin each line-of-interest, or the tags uniquely used to format that line, or .... well, most anything that's not based on counting lines in html source (or counting lines in the rendered page, since line 998 just might span 3 rendered lines in one file and only two in another).

      Unless all the 5000 files are produced by some sort of automaton
      I think it is safe to assume that - otherwise the question would not make much sense.

        So do I.

        ...But, while my hypothesized automaton can produce a uniform tag framework, my guess is that at least 1 in 5000 ( times n fields) of variable data will vary the linecount. "Otherwise the question would not make much sense" because if all 5000 files are identical, there's not much point in reading more than one of them.

        'Oh, no,' you say. 'The (normalized) data coming out of a DB should be quite consistent.'

        Well, I think OP is putting data (from an unknown origin, received via html pages) INTO a DB. And look at the data: a multi-line fragment of an html table, where some <td> items include multiple adjacent spaces (as a general rule html will render ONLY one of those, ignoring the rest) and such things as line 6 (a long form address -- in a style that could be as few as a dozen characters or so... or could be many tens of characters).

        And if the page is indeed script-generated, someone should fire the programmer, the proofreader, and/or their supervisors: Some of the boiler plate -- ie, renderable text that one might expect to be invariant in its spelling -- is not; viz: "aresss:" in line 6 and "adresse_two:" in line 7.

        Of course, such an error may not change the line count a bit, but human data entry tends to be falible, and raw data tends not be be normalized.

Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by locked_user sundialsvc4 (Abbot) on Oct 16, 2010 at 00:06 UTC

    I must admit that I find it quite remarkable that a line number is actually the deciding factor... Are you absolutely sure that this is true, “in the general case?”

    The solution that you were just offered is what we affectionately call, “a one-liner.”   In other words, it absolutely is possible to extract the 999th line from a directory-full of files in just one line of code.   But you should consider this kind of revelation to be illustrative, not necessarily a general solution.

    Please do this:   describe, as best you can, what you really want to get, and, what you really want to do with it.   Believe me when I say these three things:

    1. Whatever it is, Perl can do it ... (fairly) effortlessly.   (In other words, “that is what all the fuss is about!”)
    2. Everyone here has been exactly where you are, and understands your uncomfortable situation implicitly.   (Some might say that we are here because Perl “saved our backsides” in the past, and we never forgot the favor.)
    3. Although we can’t promise to “do your work for you” (and of course, no one seriously expects that that you expect such a thing from us...), we would be quite pleased to demonstrate just how quickly, and just how decisively, Perl can put your mind at ease...

    The many people who have observed that “Perl is the Swiss Army Knife® of professional programming,” were 100% accurate.

      Are you absolutely sure that this is true, “in the general case?”
      Of course it is not true in the "general case" and nobody ever claimed that.

      I understand the question to be a one-off task to move the content of 5000 machine-generated html-files into a database.

      That in such a scenario you use all available information to make the task easier is only natural.

        hello Morgon hello sundialsvc4 -

        thanks for the kind words and for opening my eyes for the power of PERL !

        that is just amazing! Well i get a headace when i see to browse 5000 files and do all the work by hand. This would take more than several weeks.

        so i decide to use PERL - since it is very very powerful - i try to nail down the issues while using PERL.

        See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488
        in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

        That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.
        That would be great!!

        ,,, and now i try to get more infos about HTML::TokeParser::Simple and DBI.... I have a manual of DBI - The book of Tim Bunce and aligator xy! At the moment i am on page 25... ;-)

        @Morgon, sundialsvc4: i love to hear from you....

        regards
        perlbeginne1
Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by ww (Archbishop) on Oct 16, 2010 at 14:16 UTC

    The data you want is contained in a table which follows the (bold-faced boiler-plate) phrase "Allgemeine Daten der Schule / Behörde:". Here's the html:

    </p><h1>Allgemeine Daten der Schule / Beh&ouml;rde:</h1>&nbsp;<table border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin="5"><tr> <td><strong>Schul-/Behördenname:</strong>&nbsp;</td>...

    The above begins at the start of line 989 according to w3c's html validator (which also lists 37 errors, 40-some warnings and an obsolete html doctype which w3c no longer validates).

    The section you want apparently ends with the </table> matching the open above. That's still inside line 989 according to w3c.

    In other words, the general problem is not rooted in line numbers; IMO, you can find (by regex or other tool) the opening table in your desired data and mung the data from there. The approach offered by BrowserUk looks to me like the way to go.

    The table, at least, appears reasonably well-formed if inconsistently formatted. In any case, the last row in which you appear to be interested (ie, just before the </table> mentioned above) is:

    <tr> <td><strong>Schulträger:</strong>&nbsp;</td>  <td> &lt;Verband/Verein&gt; (Verband/Verein) </td></tr>

    (for emphasis) ... still inside line 989.

    There, by gosh, that dead horse has been beaten enough!

    One takeaway might be "Look for patterns in your data." When they exist, they may help you solve your problem.