in reply to HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..

Perlbeginner1:

The concept of "line numbers" in html is -- at best -- ill-defined. Just as Perl (by and large) doesn't care about white space, html really doesn't blink an eye (nor flinch) if an entire page is on a single line.

Consider this (incomplete, but adequate for illustration):

<html><head><title>Example of one line html</title></head><body><p>The + source for this page, which contains multiple paragraphs, is one sin +gle line. There are no line-breaks in the source; just one monolithic + line with all the tags and body-content run together.</p><p>So if we +'re looking for line 2, where is it?</p><p>We'll come up with one ans +wer if we rely on the rendering by a browser, and something entirely +different (there is no line 2!), if we view the source.</p><body></ht +ml>

And compare that to a somewhat friendlier format:

<html> <head> <title>Example of multi-line html</title> </head> <body> <p>The source for this page, which contains multiple paragraphs, is mu +ltiple lines. There are line-breaks in the source; which is not just + one monolithic line with all the tags and body-content run together. +</p> <p>So if we're looking for line 2, where is it?</p> <p>We'll come up with one answer if we rely on the rendering by a brow +ser, and something entirely different (there is no line 2 in the prio +r example!), if we view the source.</p> <body> </html>

The two will render identically except for the minor changes I made in the renderable text, for the sake of making the statements true in both pages. BTW, the line numbers are not in the source, but appear as a result of the workings of the Monastery's <c>...<c> tags, while the red plus-signs are also absent from the source but are artifacts of the rendering here (indicating line continuations where that's not otherwise obvious).

Go ahead; try it. Download the two code blocks above; save them as "nobreak.html" and "breaks.html" respectively... then open each in your browser.

Then, go back and rethink your spec. Unless all the 5000 files are produced by some sort of automaton -- a script, for example, with variability provided by arguments from elsewhere -- it's unlikely that whatever your target-of-interest may be that it will reliably be line 9, 99, or 999. With html, you need something that has intrinsic meaning.

And that might be some phrase which will begin each line-of-interest, or the tags uniquely used to format that line, or .... well, most anything that's not based on counting lines in html source (or counting lines in the rendered page, since line 998 just might span 3 rendered lines in one file and only two in another).

  • Comment on Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by morgon (Priest) on Oct 16, 2010 at 01:16 UTC
    Unless all the 5000 files are produced by some sort of automaton
    I think it is safe to assume that - otherwise the question would not make much sense.

      So do I.

      ...But, while my hypothesized automaton can produce a uniform tag framework, my guess is that at least 1 in 5000 ( times n fields) of variable data will vary the linecount. "Otherwise the question would not make much sense" because if all 5000 files are identical, there's not much point in reading more than one of them.

      'Oh, no,' you say. 'The (normalized) data coming out of a DB should be quite consistent.'

      Well, I think OP is putting data (from an unknown origin, received via html pages) INTO a DB. And look at the data: a multi-line fragment of an html table, where some <td> items include multiple adjacent spaces (as a general rule html will render ONLY one of those, ignoring the rest) and such things as line 6 (a long form address -- in a style that could be as few as a dozen characters or so... or could be many tens of characters).

      And if the page is indeed script-generated, someone should fire the programmer, the proofreader, and/or their supervisors: Some of the boiler plate -- ie, renderable text that one might expect to be invariant in its spelling -- is not; viz: "aresss:" in line 6 and "adresse_two:" in line 7.

      Of course, such an error may not change the line count a bit, but human data entry tends to be falible, and raw data tends not be be normalized.

        Hey WW - thanks for posting.


        your thougths were very good and interesting! Therer are more than 5000 files - see one of the results:

        See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

        in the grey shadowed block you see the wanted information:

        17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

        That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

        i would be more than happy to hear from you. You have great ideas

        pb1