Re^3: Downloading first X bytes of a file

That might be the case, but you could have 100K of javascript and/or embedded style-sheets first and nothing guarantees that </head> is not found somewhere in there. But as your HTML file is not complete, parsing the data for the content of <head> ... </head> might become a very hazardous operation.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Comment on Re^3: Downloading first X bytes of a file Select or Download Code

Replies are listed 'Best First'.
Re^4: Downloading first X bytes of a file by moritz (Cardinal) on Jun 08, 2008 at 21:32 UTC
Well, it all depends on your application. If my script were actually an IRC bot, I'd go with a simple regex-based approach as described above. In that application it's important not to download 100K at all (risk of DoS-attacks), even if the header is that long. I know it's evil to parse HTML with regexes, but sometimes it's simple and convenient, specially if you are happy with a solution that works in 95% to 99% of all cases. Note that all markup, even comments, are disallowed in `<title>...</title>` tags, which simplifies the matter. Of course things are different for more serious matters - if you want an application that extracts the title of all valid HTML pages (and most invalid ones as well) with an accuracy matching that of the w3 markup validator you'll have to download it all.	[reply] [d/l]

Replies are listed 'Best First'.

Re^4: Downloading first X bytes of a file
by moritz (Cardinal) on Jun 08, 2008 at 21:32 UTC

I know it's evil to parse HTML with regexes, but sometimes it's simple and convenient, specially if you are happy with a solution that works in 95% to 99% of all cases. Note that all markup, even comments, are disallowed in <title>...</title> tags, which simplifies the matter.

Of course things are different for more serious matters - if you want an application that extracts the title of all valid HTML pages (and most invalid ones as well) with an accuracy matching that of the w3 markup validator you'll have to download it all.

[reply]
[d/l]