http://qs1969.pair.com?node_id=175910

PyroX has asked for the wisdom of the Perl Monks concerning the following question:

Hey Folks,

I have a query, it has been begging me to deal with it.

I, or my associates, have the need for a large-scale search appliance. I would like to end up with functionality similar to that of google.com, but I don't really care who's linking to who and why.

I need to build a spider, it doen't have to be very complicated, basically: open initial page submitted to be crawled, parse the page's output, gather links and image names and url's (build full URL's as we walk), get any non script/html text longer than x chars, add a database entry for that page, check each link gathered on page against a list of domains that we can't leave, discard the bads, follow the goods, and start over again. When we get lost or messup real bad, we die and start a child to pick up on the next link.

Simple.

Parsing the pages and checking the domains is easy. So is the database portions, well, all of this is easy.

My question is, has this been done already, do you guys recommend I dev this in perl, or should I look elsewhere? What are your thoughts/blessings/jeers

Replies are listed 'Best First'.
Re: Perl Search Applicance
by samtregar (Abbot) on Jun 20, 2002 at 05:00 UTC
    When you say "functionality similar to that of google.com" what do you mean exactly? When I hear that I'm guessing you mean that you need something of similar search result quality and depth of indexing. If that is the case stop right now and go buy the darn thing from Google. You won't get there on your own.

    As a reference point, I once built a search engine combining Apache/mod_perl, MySQL and Glimpse. It took around 4 months to complete working alone. It indexed all of the Open Directory project and served most queries in under a second running on a PII/600. The search result format was actually more complicated than Googles - it included the category hierarchy and had advanced tree-limiting features.

    The project was generally successful. However, it never came close to providing something comparable to Google. Why not? The search results sucked, to put it mildly. All it did was a simple partial-word match. Glimpse supported more but the more advanced features were too slow to use. Also, the indexing was really really slow. It would never scale to indexing the entire Internet no matter how much hardware you put behind it. As it was it took around 6 hours to index the Open Directory database (although much of that was in character-set translation).

    So, in short, be very careful about what you attempt here. If you need Google, buy Google (or one of the competitors like Verity, etc.). If you can make do with much less then you might build it yourself. But have no illusions about what you'll end up with.

    -sam

      I mean, the good results, and the ability to classify results ratings and translate some pages, as well as provide search stats and so on.

      Alas, I do not have $80,000 to buy google hardware. (I wish I did)

      This is not going to map the internet, it will be used on about 120000 internal pages, on a couple domains. (Small pages, under 4k mostly)

      If you could point to some of your code as examples, I would appriciate it ;)
        If you don't have $80,000 to buy Google's solution then the chances are very good you don't have the money to develop it either. So the question becomes, can you live with what you can build?

        First off, you want to be able to "translate some pages". That's a pretty tall order. Are you at least planning to buy this piece or are you going to build this too?

        Second, you need to get a clearer idea of how good the search results need to be. Google has the best algorithms in the business and they aren't publishing them! People have a pretty good idea how they do it but replicating it will take a lot of hard work and more than a few braniacs in the barn. I recommend you look at Glimpse to see what some really smart people have been able to do with quite a lot of time. Google it ain't, but it's not bad either. Maybe you can use it as a backend component the way I did.

        Unfortunately the search engine I built went straight to /dev/null along with the company that paid for it. They never sold a single copy to my knowledge. Wasn't the 90s fun?

        -sam

      I think you'll find Harvest to provide the web crawling, indexing and the web search UI you're looking for.

      It's very scalable (can be clustered for both search and crawl IIRC) is pretty mature, and is actively developed.

Re: Perl Search Applicance
by inblosam (Monk) on Jun 20, 2002 at 07:00 UTC
      Thank you for the link, I find them very useful. -hey I just love examples.
Re: Perl Search Applicance
by zakb (Pilgrim) on Jun 20, 2002 at 08:11 UTC
    You might want to look at htDig, which is an open source search engine. It's been around a while, a number of sites use it. However, it may not produce the quality of results you'd expect from Google for the reasons given in the other replies...
Re: Perl Search Applicance
by mattr (Curate) on Jun 20, 2002 at 13:29 UTC
    Hi. A well-written engine and a well-compiled database will make a Perl-only engine look pretty good, but you may want to use a C-based engine for all or just backend if you need sheer power. It depends on your speed requirements and how structured the data is. Do you need multiple cursors so that many people can search concurrently? Does the database have to be updatable while being searched?

    120K pages @ 4KB/page = 480MB. Maybe cut off 100MB or more if it is HTML. This is not google-scale by magnitudes, rather it is medium-sized for htdig. I have a mod_perl and htdig system running on a gigabyte of data from 60 websites and it is running without a hitch. It is the engine on www.omron.com. The hardware is a 2 year old $6K, 5U RedHat box with 5 RAID disks and a hot standby that provide way more than enough power currently, plus a smaller backup box. Downloading and indexing sequentially takes 17 hours, but the indexing itself is quite fast. I spent a long time making it look not cheesy, which worked at least partly :) . It can do fuzzy searching (maybe turned off now) and stems words with morphological analysis so you hit plurals and gerunds, etc. I also built a Perl-based administration section which proved useful, and they had me add another search site for their corporate news. You can maybe do these things too.

    As for your files, that many 4K size pages tells me either you have an awful lot of poetry in your collection, you run a translation company, or the data is probably well-structured. Are you sure you can't get this data into a database and chop it up more? What does it look like and where did the data come from? Some analysis might even make a big hash perform well, memory's cheap. But real text searching involves text analysis (parts of speech, soundex, etc.) some kind of query analysis, and an inverted index, plus maybe caching and threading. Berkeley db also is a good thing here. This is a significant amount of work even if there are many bits of C and Perl code that can help.

    There is also the matter of how fast this system has to react, i.e. < 1 second. So building this could cost $30-80K of your time. Making something which performs well in the real world is more than a "simple" job. Also, information indexing is one of the interesting research fields out there, not simplistic at all and you get out exactly what you put into it. If you want some code I could sell you something that works, but it sounds like you want to build it all yourself. In that case have fun! I think you've already got a lot of hints; you have to do some homework on your own and then come back here maybe. Definitely I'd recommend trying to code an inverted index for your own fun, to see how it handles say 100MB of data.

    As for other available software, you could consider glimpse, wais, google, IIS Indexing Server, and even mysql's SQL searching (haven't tried it on large data). There is also Ultraseek, but having installed that once (mitsubishi.com) I do not recommend it unless it is short, sweet, and you have bucks. The liscense grows with document (or was it URL) number, there was no developer documentation, and I found myself having to decipher cryptic python code embedded in html a la ASP to do even a minimum of customization. But it works too.

    Translation? Please explain this more. A translation server is not cheap, though maybe you want SYSTRAN? Or are you going to be doing this by hand, or throwing it at the fish?

    UPDATED:

    Thanks for your response.

    In that case I can recommend breaking down the search problem into pieces if possible to homogenize your data. If you have small similar pages they may be easier to handle with one routine for example. You may be interested in htdig.org for a straightforward index with many little bells and whistles, also you can get into the source code yourself. (I confess I had to remove a robot rule at their request.=8| But htdig is tough to configure and doesn't provide phrase searching. It is not a google beater but maybe a little like Altavista.

    There is also namazu.org, made for Japanese language mainly and not as well documented in English, also not as fast but it's useful and popular in Japan. Incidentally, namazu's indexer is Perl and the search is in C, which may be of interest to you.

    There are a number of simple, weak perl search programs out there, but I totally do not vouch for them and one or another probably has a known security hole. I saw some at cgi.resourcindex.com. But there is a big difference between indexing 5MB and 500MB.

    There is an old wired article I found once at webmonkey about a simple inverted index program that may help illustrate some concepts. But there are a number of C/C++ engines developed over the past 10 years with Perl interfaces. Some people have mentioned glimpse though someone on the htdig ML said their glimpse crashed at 150K documents, 2 years ago. Also swish-e is used for searching CPAN for example and has a Perl API. Maybe you should look at searchtools.com which has links to Perl search programs and many others.

    Finally to be fair I shold mention JuggernautSearch which has drawn some flaming on the htdig mailing list. It is listed in the Perl section of searchtools.com and mentions indexing a large number of documents, though it is relatively simple in its indexing. So while I have not tried it, it seems possible that a Perl-only search (which I think Juggernaut is?) could index 150,000 pages in realtime. Anybody else know about this?

      I should be more clear. Many of the pages are indeed very small, however there are a lot that are not. A lot of them are quite large as well, containing several hundred contacts, or procedures. I am mainly gathering information to "blueprint" the project. I want ideas and opinions and yours have been helpful. For translation, I may just cancel that all together. I want to build this to be as fast and inexpensive as possible, and yet provide as many features as I can. I just want to turn out a good product.
Re: Perl Search Applicance
by shotgunefx (Parson) on Jun 20, 2002 at 06:05 UTC
    In addition to the other comments, you might find this technical overview of google by it's founders informative.

    -Lee

    "To be civilized is to deny one's nature."
Re: Perl Search Applicance
by Bluepixel (Beadle) on Jun 20, 2002 at 09:02 UTC
    I'm currently too (re)writing a search engine in perl.

    I'm using POE for simultanous http requests to index the sites. The most important thing is to pay attention to your database layout, because I had to modify it three times, because some features I decided to add required it.
    And make a good plan before you start writing it. Be sure how you want to rate the indexed pages.
    Edit: You might also have a look at the google programming contest group. (http://groups.google.com/groups?q=google.public.programming-contest). People have mentioned there some ideas on how to rate a page.
Re: Perl Search Applicance
by smitz (Chaplain) on Jun 20, 2002 at 08:32 UTC
    Just a quick tip/opinion for you:
    Try to avoid WWW::Robot as the basis for your spider. For unknown reasons, it tends to lock up after a few crawls, see this for more info. Im currently talking to the authors at Canon Labs and will update the latter link with any info I get.
    Just my two pennies.

    SMiTZ
Re: Perl Search Applicance
by hacker (Priest) on Jun 20, 2002 at 12:14 UTC
    This sounds very similar to a project I've been working on that spiders webpages, extracts links, and packs them up into a file for installation and viewing on a Palm® handheld. Feel free to grab my code as a starting point, and get back to me with your updates/fixes/etc. Perhaps we can work together to help meet both of our goals.
Re: Perl Search Applicance
by $name (Pilgrim) on Jun 20, 2002 at 15:58 UTC
    Here is a link to most all search tools appliances perl scrips etc..
    Hope it helps
    Let me know how it turns out.
    MGW Aplications Developer QuinnTeam Inc.
Re: Perl Search Applicance
by PyroX (Pilgrim) on Jun 20, 2002 at 16:24 UTC
    All the input has been great. Plus I have the urge to actually write the thing now. I am thinking about going ahead with my idea for a "overlord" of types, I would like to be able to logon, see what pages are being looked at, start/stop any of the bots, and change discard links I dont want them to follow that they are coming up on.

    I am thinkin about a 4 process system, the overlord, and 3 bots, I want to make them slow and smart, I don't really care if it takes them a day to get through 1000 pages, or even less.

    My main concern is the response time for the searches.

    What do you (all) recommend for the database? Remember I will not spend a bunch of $$$ for anything. I know how horrible MySQL is with over 500,000 of anything (ie rows in a table). I need more guidance with that portion.
Re: Perl Search Applicance
by mattr (Curate) on Jun 21, 2002 at 09:05 UTC
    About PDF indexing, I've used several including the one mentioned above. I think xpdf should do the trick for you. Careful you aren't getting pdf2text from google, you may get a much much bigger system for Japanese pdfs from someone else. There is a PDF resource page I forget where.

    Some limitations I have seen in this and/or other converters is adobe level 2 fonts, japanese, and (lack of) images.

    Also I haven't tried it but it seems Adobe now has something here that might be applicable.

Re: Perl Search Applicance
by PyroX (Pilgrim) on Jun 20, 2002 at 18:04 UTC
    ALSO:

    I would like to be able to parse PDF files, and process the text, anyone have some working examaples for pdf->text conversions?
      I'm using pdf2text to convert PDFs to plain text before indexing, but it's not that fast (but then, I only index some 200 PDFs)
      -- #!/usr/bin/perl for(ref bless[],just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
      As mattr has pointed out, have a look at namazu.org . Their crawler seems also to index pdf pages.
      I would recommend you, to try out or read the code of the other search engines mattr provided in his post, before starting writing your own one. You will get a lot of usefull ideas from them.

      As for the database, I currently use mysql (unfortunately.., it's slow). I give each word a unique id, and then split the words found in the documents over several tables, so the tables won't get too large.
Re: Perl Search Applicance
by Cody Pendant (Prior) on Jun 22, 2002 at 04:06 UTC
    It's worth noting that the thing which makes Google so popular is it's result matching.

    Their database isn't just "which pages contain which words at which URLs" that's easy -- ignoring size for a moment! -- but the good thing about Google is it cross-matches pages with links to them from somewhere else.

    You might have a page without the word "health" in it at all. It might be full of "well-being" and "fitness" and medical words like "viability", but if a million people have linked to it with "here's a good health page", then a search for "health" will have your page at the top anyway. --

    ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;