rob_au has asked for the wisdom of the Perl Monks concerning the following question:

I have come to a point in the development of an application where I need to bounce a few ideas off a few of my fellow monks - I am building an application which builds an index of certain web pages. I have at this stage tentatively decided upon using Berkeley DB 3.x as the basis for my data store (interfaced specifically via BerkeleyDB::Hash) and am now trying to decide upon the best aspect of the web pages being indexed to store as the hash key.

The most direct method of course would be to use the escaped URL of the web page as the key (most likely that generated by URI::Escape), but I am wondering if there might exist a cleaner and more expansive (read, ordered) way to index such pages. I have also considered using a MD5 hash of either the URL or the page itself as the key for indexing, but this seems to be an overkill with the time involved in subsequently generating these MD5 hashes to perform a lookup. The onus here for ease and speed is not so much in the indexing but the subsequent matching and lookup of the data - It should be noted that subsequent lookup will be again derived from the location URL.

Should I stick with the idea of an escaped URL as the hash key or do other monks here have a more ordered approach that I can use to index this data?

 

Ooohhh, Rob no beer function well without!

Replies are listed 'Best First'.
Re: Data indexing in BerkeleyDB hashes
by blakem (Monsignor) on Sep 16, 2001 at 12:40 UTC
    I don't think an MD5 hash of the page itself would work well as a lookup key.... How are you going to regenerate it when you want to lookup the information?

    That said, I would think an escaped URI might be the best solution, though I haven't done extensive work with BerkeleyDB... (just a few small int=>data lookup tables)

    -Blake

Re: Data indexing in BerkeleyDB hashes
by thpfft (Chaplain) on Sep 16, 2001 at 15:23 UTC

    am now trying to decide upon the best aspect of the web pages being indexed to store as the hash key.

    ...depends what you're going to know when you want to get the value out again, no? If you're just going to be looking up pages then i'd say an escaped url was ideal, though you might need to consider synonyms like / and /index.s?html? and all that.

    I've been in roughly this situation before and settled on using id numbers, out of the misguided conviction that indexing and sorting would be more efficient. I ended up performing far more lookups than were really necessary, mostly just to get the url back :(

Re: Data indexing in BerkeleyDB hashes
by perrin (Chancellor) on Sep 16, 2001 at 19:43 UTC
    If the URLs are unique, you can use them. If not, you need something else. Incidentally, you don't have to escape the URLs because any character that is valid in a URL will be valid as a BerkeleyDB key.
Re: Data indexing in BerkeleyDB hashes
by shotgunefx (Parson) on Sep 17, 2001 at 00:54 UTC
    If it is to spider one domain, you could use the escaped relative url. It could cut down the key lengths quite a bit. (Don't know if this is an issue with Berkeley)

    If you want to get really sidetracked , check out The Google Whitepaper

    -Lee

    "To be civilized is to deny one's nature."