There was a discussion thread about this a couple of weeks ago.
Short answer:
PS: It won't be that quick. In my experience, HTML::TreeBuilder takes a significant fraction of a second to load an average HTML document, so multiply that by 50_000 files and you are looking at a day or so of processing time. Also you need to explicitly delete html trees once you are done with them, as the data structures generated by HTML::TreeBuilder will not be freed automatically when they go out of scope, and each will consume several megabytes of RAM.
In reply to Re: how to quickly parse 50000 html documents?
by chrestomanci
in thread how to quickly parse 50000 html documents?
by brengo
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |