Re: Re: Re: What is the fastest way to parse HTML?

Well, i have to regularly index a few million documents for a small intranet search engine.

Then you asked the wrong question. The right one is: "What is the fastest way to index a few million documents for a small intranet search engine?"

The answer, as I recently learned from tachyon, is Swish-e. Of course, you'll also want to grab the Perl interface, SWISH, from CPAN.

-sauoq
"My two cents aren't worth a dime.";

Comment on Re: Re: Re: What is the fastest way to parse HTML?

Replies are listed 'Best First'.
Re4: What is the fastest way to parse HTML? by dragonchild (Archbishop) on Jul 23, 2003 at 14:13 UTC
I, too, concur with Swish. Granted, I used it 8 years ago, but it was an excellent tool. ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]