in reply to HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
The concept of "line numbers" in html is -- at best -- ill-defined. Just as Perl (by and large) doesn't care about white space, html really doesn't blink an eye (nor flinch) if an entire page is on a single line.
Consider this (incomplete, but adequate for illustration):
<html><head><title>Example of one line html</title></head><body><p>The + source for this page, which contains multiple paragraphs, is one sin +gle line. There are no line-breaks in the source; just one monolithic + line with all the tags and body-content run together.</p><p>So if we +'re looking for line 2, where is it?</p><p>We'll come up with one ans +wer if we rely on the rendering by a browser, and something entirely +different (there is no line 2!), if we view the source.</p><body></ht +ml>
And compare that to a somewhat friendlier format:
<html> <head> <title>Example of multi-line html</title> </head> <body> <p>The source for this page, which contains multiple paragraphs, is mu +ltiple lines. There are line-breaks in the source; which is not just + one monolithic line with all the tags and body-content run together. +</p> <p>So if we're looking for line 2, where is it?</p> <p>We'll come up with one answer if we rely on the rendering by a brow +ser, and something entirely different (there is no line 2 in the prio +r example!), if we view the source.</p> <body> </html>
The two will render identically except for the minor changes I made in the renderable text, for the sake of making the statements true in both pages. BTW, the line numbers are not in the source, but appear as a result of the workings of the Monastery's <c>...<c> tags, while the red plus-signs are also absent from the source but are artifacts of the rendering here (indicating line continuations where that's not otherwise obvious).
Go ahead; try it. Download the two code blocks above; save them as "nobreak.html" and "breaks.html" respectively... then open each in your browser.
Then, go back and rethink your spec. Unless all the 5000 files are produced by some sort of automaton -- a script, for example, with variability provided by arguments from elsewhere -- it's unlikely that whatever your target-of-interest may be that it will reliably be line 9, 99, or 999. With html, you need something that has intrinsic meaning.
And that might be some phrase which will begin each line-of-interest, or the tags uniquely used to format that line, or .... well, most anything that's not based on counting lines in html source (or counting lines in the rendered page, since line 998 just might span 3 rendered lines in one file and only two in another).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files..
by morgon (Priest) on Oct 16, 2010 at 01:16 UTC | |
by ww (Archbishop) on Oct 16, 2010 at 03:43 UTC | |
by Perlbeginner1 (Scribe) on Oct 16, 2010 at 11:46 UTC |