svlada has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all of you Perl Ninjas :) I am new in this community(I am Java dev). Recently I have got some new idea for one pet project. Project is about crawling around 300 web sites on weekly basis. Every website should have dedicated crawler. I have dillema which language to use for this task. Perl or Java?

Replies are listed 'Best First'.
Re: Perl spider
by davido (Cardinal) on Apr 21, 2011 at 21:32 UTC

    whichever language you are most comfortable with. if you know both languages already you would know Perl is well suited to such things,especially if you throw in a few well chosen modules. but if you dont already know Perl you might just want to get moving with what you are proficient with.


    Dave

      Well i have done little research and found some interesting modules like Perl LWP and Mechanize. It looks really easy and simple. My question is actually how efficient is Perl in terms of memory and cpu utilization in compare to java for this particular task.
        I have no big experience with spiders, but don't you think that the web will be the bootleneck?

        I expect Perl to parse HTML much faster than the server responses come in.

        Cheers Rolf

        Perl has modules like Scrappy (All Powerful Web Harvester, Spider, Scraper fully automated), WWW::Crawler::Lite, WWW::Spyder or Gungho.

        No need to amuse yourself with the low-level stuff.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Perl spider
by John M. Dlugosz (Monsignor) on Apr 22, 2011 at 05:03 UTC
    Well, slurping through text is what Perl was built for.

    If most of what you need is already on CPAN, it's a sure win.

Re: Perl spider
by believer (Sexton) on Apr 22, 2011 at 09:02 UTC
    I have some experience with crawling in both languages, and I found that there's not much difference between Java and Perl when it comes to performance.

    I think another choice is more important:

    1) use a browser + Selenium. Heavy on resources, but comes with a lot of features that can dramatically cut development time per website.

    2) use lightweight modules like WWW::Mechanize. Cheap on resources, but you will get guaranteed headaches for sites that are heavy on obscure javascript.

    Crawling 300 websites once a week is not that heavy, so I would go for option 1.
Re: Perl spider
by CountZero (Bishop) on Apr 22, 2011 at 19:18 UTC
    Of course, real men will program that spider in assembly at the level of the TPC/IP stack of your wireless router.

    No, seriously, why only considering Perl or java?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      probably knows perl and java -- learning a new language isn't always in the budget
Re: Perl spider
by anonymized user 468275 (Curate) on Apr 22, 2011 at 14:24 UTC
    the google white paper at http://infolab.stanford.edu/~backrub/google.html declares that google's crawler started off in Python (which is probably higher performance than Java) but anything requiring best performance is written in C and C++.

    So Java, which inflicts a foreign virtual machine architecture on every execution, carries the risk that it cannot take advantage of procedural programming compiled into native machine code where a high throughput algorithm demands it.

    Perl on the other hand is implemented in C and can even call C routines if really necessary, while any object-oriented needs it can take care of itself.

    So given that Google is a tried and tested approach, it seems to make sense to use the languages most able to follow in Google's footsteps, albeit using C where necessary, i.e. Perl and C.

    One world, one people