in reply to Using filepath method to identify an .html page

Does it have to be an integer? How about a hex string?

$ cat testMD5.pl use strict; use Digest::MD5 qw/ md5_hex /; my $digest=md5_hex("http://www.berghold.net"); printf "%s\n",$digest
which yields:
$ perl testMD5.pl 84b40875c5bc4da7ae368175025a32f9
Just a thought...


Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

Replies are listed 'Best First'.
Re^2: Using filepath method to identify an .html page
by Nik (Initiate) on Jan 22, 2013 at 16:14 UTC
    yes it has to be a 4-digit number to fit the database table's respective column

    here i just posted what i want to to more clearly: http://www.perlmonks.org/?node_id=1014708

      OK: How about this:

      $ cat testMD5.pl use strict; foreach my $url(qw@ /index.html /about/time.html @){ hashit($url); } sub hashit { my $url=shift; my @ltrs=split(//,$url); my $hash = 0; foreach my $ltr(@ltrs){ $hash = ( $hash + ord($ltr)) %10000; } printf "%s: %0.4d\n",$url,$hash }
      which yields:
      $ perl testMD5.pl /index.html: 1066 /about/time.html: 1547
      Keep in mind this is hardly bullet proof. You need to also keep in mind a method to detect hash collisions and and a rehash algorithm.

      This brings to mind "the old days" circa 1974 writing assemblers for 8080 microprocessors. Symbol "folding" and hashing.

      UPDATE: Limiting yourself to four digits may not be very useful if you have a lot of pages that you are trying to index into your database. The wider your hash is the less likely there will be hash collisions.


      Peter L. Berghold -- Unix Professional
      Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
        The perl code will produce the same hash for "abc.html" as for "bca.html"

        In any case, the likelihood of a hash collision for any non-trivial website is substantial. If you hash 100 files you have about a 40% chance of a collision.

        If you hash 220 files, the likelihood is about 90%
        Now that iam thinking of it more and more, i don't have to turn the 'path' back to a 'number'

        So, what i want is a function foo() that does this:

        foo( "some long string" ) --> 1234

        =====================
        1. User requests a specific html page( .htaccess gives my script the absolute path for that .html page)
        2. turn the 'path' to 4-digit number and store it to tha database as 'pin' (how?)
        3. i store that number to the database. I DONT EVEN HAVE TO STORE THE HTML PAGE'S PATH TO THE DATABASE ANYMORE!!! this is just great!

        At some later time i want to check the weblog of that .html page

        1. request the page as: http://mydomain.gr/index.html?show=log
        2. .htaccess gives my script the absolute path of the requested .html file
        3. turn the 'path' to 4-digit number (this is what i'am asking)
        4. use 'pin' variable to select all log records for that specific .html page (based on the 'pin' column)


        Since i have the requested 'path' which has been converted to a database stored 4-digit number, i'am aware for which page i'am requesting detailed data from, so i look upon the 'pin' column in the database and thus i know which records i want to select. NO NEED to store absolute apths anymore, just a 4-digit number for each .html page

        No need, to turn the number back to a path anymore, just the path to a number, to identify the specific .html page

        Does your solution which SEEMS GREAT APPLY to my specifications?
Re^2: Using filepath method to identify an .html page
by Anonymous Monk on Jan 23, 2013 at 11:00 UTC

    The OP seems a bit unwilling to listen, but if you truncate the hash, you meet his spec.

    use Digest::MD5 qw/ md5_hex /; my $digest=md5_hex("http://www.berghold.net"); printf "%d\n", hex(substr($digest, -4)) % 10_000;

    So the one-liner he's looking for would be hex(substr(md5_hex($someurl), -4)) % 10_000. I'm betting that is somewhat collision resistant. The OP still should consider enlarging his int2 column to int4 or int8 or even char(4)