Perlbeginner1:

The concept of "line numbers" in html is -- at best -- ill-defined. Just as Perl (by and large) doesn't care about white space, html really doesn't blink an eye (nor flinch) if an entire page is on a single line.

Consider this (incomplete, but adequate for illustration):

<html><head><title>Example of one line html</title></head><body><p>The + source for this page, which contains multiple paragraphs, is one sin +gle line. There are no line-breaks in the source; just one monolithic + line with all the tags and body-content run together.</p><p>So if we +'re looking for line 2, where is it?</p><p>We'll come up with one ans +wer if we rely on the rendering by a browser, and something entirely +different (there is no line 2!), if we view the source.</p><body></ht +ml>

And compare that to a somewhat friendlier format:

<html> <head> <title>Example of multi-line html</title> </head> <body> <p>The source for this page, which contains multiple paragraphs, is mu +ltiple lines. There are line-breaks in the source; which is not just + one monolithic line with all the tags and body-content run together. +</p> <p>So if we're looking for line 2, where is it?</p> <p>We'll come up with one answer if we rely on the rendering by a brow +ser, and something entirely different (there is no line 2 in the prio +r example!), if we view the source.</p> <body> </html>

The two will render identically except for the minor changes I made in the renderable text, for the sake of making the statements true in both pages. BTW, the line numbers are not in the source, but appear as a result of the workings of the Monastery's <c>...<c> tags, while the red plus-signs are also absent from the source but are artifacts of the rendering here (indicating line continuations where that's not otherwise obvious).

Go ahead; try it. Download the two code blocks above; save them as "nobreak.html" and "breaks.html" respectively... then open each in your browser.

Then, go back and rethink your spec. Unless all the 5000 files are produced by some sort of automaton -- a script, for example, with variability provided by arguments from elsewhere -- it's unlikely that whatever your target-of-interest may be that it will reliably be line 9, 99, or 999. With html, you need something that has intrinsic meaning.

And that might be some phrase which will begin each line-of-interest, or the tags uniquely used to format that line, or .... well, most anything that's not based on counting lines in html source (or counting lines in the rendered page, since line 998 just might span 3 rendered lines in one file and only two in another).


In reply to Re: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files.. by ww
in thread HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files.. by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.