Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: how to quickly parse 50000 html documents?

by afoken (Chancellor)
on Nov 26, 2010 at 10:48 UTC ( #873828=note: print w/replies, xml ) Need Help??

in reply to how to quickly parse 50000 html documents?

So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?

I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?

Well, I'm tempted to answer "start by parsing one file, repeat that for the remaining 49.999 files".

No, really. Start with one HTML file, write readable code, DON'T optimize AT ALL. Use whatever seems to be reasonable. Don't slurp files yourself if the parsing module has a function to read from a file. Try if your code works with a second HTML file, and a third. Fix bugs. Still, DON'T optimize.

svn commit (you may also use CSV, git, whatever. But make sure you can get back old versions of your code.)

Now, install Devel::NTYProf, and run perl -d:NYTProf file1.html followed by nytprofhtml. Open nytprof/index.html and find out which code takes the most time to run. Look at everything with a red background. Optimize that code, and only that code.

Repeat until you find no more code to optimize.

Repeat with several other HTML files.

Be prepared to find modules (from CPAN) that are far from being optimized for speed. Try to switch to a different module if your script spends most of the time in a third-party module. Run NYTProf again after switching. Compare total time used before and after switching. Use whatever is faster. (For example, I learned during profiling that XML::LibXML was more than 10 times faster than XML::Twig with my problem and my data.)

Repeat profiling with several files at once, find code that is called repeatedly without need to do so. Eleminate that code if it slows down processing.

Note that HTML and XML are two different things that have very much in common. Perhaps XML::LibXML is able to parse your HTML documents (using the parse_html_file() method) good enough to be helpful, and faster than any pure Perl module could ever run. Try if XML::LibXML can read your HTML documents at all, then compare the speed using NYTProf.

<update>If you have a multi-processor machine, try to run several jobs in parallel. Have a managing process that keeps N (or 2N) worker processes working, where N is the number of CPU cores.</update>


Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://873828]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2023-12-08 19:47 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (37 votes). Check out past polls.