Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hey Folks,

I have a query, it has been begging me to deal with it.

I, or my associates, have the need for a large-scale search appliance. I would like to end up with functionality similar to that of google.com, but I don't really care who's linking to who and why.

I need to build a spider, it doen't have to be very complicated, basically: open initial page submitted to be crawled, parse the page's output, gather links and image names and url's (build full URL's as we walk), get any non script/html text longer than x chars, add a database entry for that page, check each link gathered on page against a list of domains that we can't leave, discard the bads, follow the goods, and start over again. When we get lost or messup real bad, we die and start a child to pick up on the next link.

Simple.

Parsing the pages and checking the domains is easy. So is the database portions, well, all of this is easy.

My question is, has this been done already, do you guys recommend I dev this in perl, or should I look elsewhere? What are your thoughts/blessings/jeers

In reply to Perl Search Applicance by PyroX

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-29 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found