Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Using filepath method to identify an .html page

by blue_cowdawg (Monsignor)
on Jan 22, 2013 at 16:04 UTC ( #1014710=note: print w/replies, xml ) Need Help??


in reply to Using filepath method to identify an .html page

      What i want to do, is to associate a number to an html page's absolute path

Does it have to be an integer? How about a hex string?

$ cat testMD5.pl use strict; use Digest::MD5 qw/ md5_hex /; my $digest=md5_hex("http://www.berghold.net"); printf "%s\n",$digest
which yields:
$ perl testMD5.pl 84b40875c5bc4da7ae368175025a32f9
Just a thought...


Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

Replies are listed 'Best First'.
Re^2: Using filepath method to identify an .html page
by Nik (Initiate) on Jan 22, 2013 at 16:14 UTC
    yes it has to be a 4-digit number to fit the database table's respective column

    here i just posted what i want to to more clearly: http://www.perlmonks.org/?node_id=1014708

      OK: How about this:

      $ cat testMD5.pl use strict; foreach my $url(qw@ /index.html /about/time.html @){ hashit($url); } sub hashit { my $url=shift; my @ltrs=split(//,$url); my $hash = 0; foreach my $ltr(@ltrs){ $hash = ( $hash + ord($ltr)) %10000; } printf "%s: %0.4d\n",$url,$hash }
      which yields:
      $ perl testMD5.pl /index.html: 1066 /about/time.html: 1547
      Keep in mind this is hardly bullet proof. You need to also keep in mind a method to detect hash collisions and and a rehash algorithm.

      This brings to mind "the old days" circa 1974 writing assemblers for 8080 microprocessors. Symbol "folding" and hashing.

      UPDATE: Limiting yourself to four digits may not be very useful if you have a lot of pages that you are trying to index into your database. The wider your hash is the less likely there will be hash collisions.


      Peter L. Berghold -- Unix Professional
      Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
        Now that iam thinking of it more and more, i don't have to turn the 'path' back to a 'number'

        So, what i want is a function foo() that does this:

        foo( "some long string" ) --> 1234

        =====================
        1. User requests a specific html page( .htaccess gives my script the absolute path for that .html page)
        2. turn the 'path' to 4-digit number and store it to tha database as 'pin' (how?)
        3. i store that number to the database. I DONT EVEN HAVE TO STORE THE HTML PAGE'S PATH TO THE DATABASE ANYMORE!!! this is just great!

        At some later time i want to check the weblog of that .html page

        1. request the page as: http://mydomain.gr/index.html?show=log
        2. .htaccess gives my script the absolute path of the requested .html file
        3. turn the 'path' to 4-digit number (this is what i'am asking)
        4. use 'pin' variable to select all log records for that specific .html page (based on the 'pin' column)


        Since i have the requested 'path' which has been converted to a database stored 4-digit number, i'am aware for which page i'am requesting detailed data from, so i look upon the 'pin' column in the database and thus i know which records i want to select. NO NEED to store absolute apths anymore, just a 4-digit number for each .html page

        No need, to turn the number back to a path anymore, just the path to a number, to identify the specific .html page

        Does your solution which SEEMS GREAT APPLY to my specifications?
        The perl code will produce the same hash for "abc.html" as for "bca.html"

        In any case, the likelihood of a hash collision for any non-trivial website is substantial. If you hash 100 files you have about a 40% chance of a collision.

        If you hash 220 files, the likelihood is about 90%
Re^2: Using filepath method to identify an .html page
by Anonymous Monk on Jan 23, 2013 at 11:00 UTC

    The OP seems a bit unwilling to listen, but if you truncate the hash, you meet his spec.

    use Digest::MD5 qw/ md5_hex /; my $digest=md5_hex("http://www.berghold.net"); printf "%d\n", hex(substr($digest, -4)) % 10_000;

    So the one-liner he's looking for would be hex(substr(md5_hex($someurl), -4)) % 10_000. I'm betting that is somewhat collision resistant. The OP still should consider enlarging his int2 column to int4 or int8 or even char(4)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1014710]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (1)
As of 2023-06-06 03:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (26 votes). Check out past polls.

    Notices?