Re: Perl spider
by davido (Cardinal) on Apr 21, 2011 at 21:32 UTC
|
whichever language you are most comfortable with. if you know both languages already you would know Perl is well suited to such things,especially if you throw in a few well chosen modules. but if you dont already know Perl you might just want to get moving with what you are proficient with.
| [reply] |
|
|
Well i have done little research and found some interesting modules like Perl LWP and Mechanize. It looks really easy and simple. My question is actually how efficient is Perl in terms of memory and cpu utilization in compare to java for this particular task.
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: Perl spider
by John M. Dlugosz (Monsignor) on Apr 22, 2011 at 05:03 UTC
|
Well, slurping through text is what Perl was built for.
If most of what you need is already on CPAN, it's a sure win. | [reply] |
Re: Perl spider
by believer (Sexton) on Apr 22, 2011 at 09:02 UTC
|
I have some experience with crawling in both languages, and I found that there's not much difference between Java and Perl when it comes to performance.
I think another choice is more important:
1) use a browser + Selenium. Heavy on resources, but comes with a lot of features that can dramatically cut development time per website.
2) use lightweight modules like WWW::Mechanize. Cheap on resources, but you will get guaranteed headaches for sites that are heavy on obscure javascript.
Crawling 300 websites once a week is not that heavy, so I would go for option 1. | [reply] |
Re: Perl spider
by CountZero (Bishop) on Apr 22, 2011 at 19:18 UTC
|
Of course, real men will program that spider in assembly at the level of the TPC/IP stack of your wireless router.No, seriously, why only considering Perl or java?
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] |
|
|
probably knows perl and java -- learning a new language isn't always in the budget
| [reply] |
Re: Perl spider
by anonymized user 468275 (Curate) on Apr 22, 2011 at 14:24 UTC
|
the google white paper at http://infolab.stanford.edu/~backrub/google.html declares that google's crawler started off in Python (which is probably higher performance than Java) but anything requiring best performance is written in C and C++. So Java, which inflicts a foreign virtual machine architecture on every execution, carries the risk that it cannot take advantage of procedural programming compiled into native machine code where a high throughput algorithm demands it. Perl on the other hand is implemented in C and can even call C routines if really necessary, while any object-oriented needs it can take care of itself.
So given that Google is a tried and tested approach, it seems to make sense to use the languages most able to follow in Google's footsteps, albeit using C where necessary, i.e. Perl and C.
| [reply] |