Re^3: Using filepath method to identify an .html page

in reply to Re^2: Using filepath method to identify an .html page
in thread Using filepath method to identify an .html page

OK: How about this:

$ cat testMD5.pl
use strict;

foreach my $url(qw@ /index.html /about/time.html @){
        hashit($url);
}

sub hashit {
   my $url=shift;
   my @ltrs=split(//,$url);
   my $hash = 0;

   foreach my $ltr(@ltrs){
        $hash = ( $hash + ord($ltr)) %10000;
   }
   printf "%s: %0.4d\n",$url,$hash
   
}
[download]

which yields:

$ perl testMD5.pl 
/index.html: 1066
/about/time.html: 1547
[download]

Keep in mind this is hardly bullet proof. You need to also keep in mind a method to detect hash collisions and and a rehash algorithm.

This brings to mind "the old days" circa 1974 writing assemblers for 8080 microprocessors. Symbol "folding" and hashing.

UPDATE: Limiting yourself to four digits may not be very useful if you have a lot of pages that you are trying to index into your database. The wider your hash is the less likely there will be hash collisions.

Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

Comment on Re^3: Using filepath method to identify an .html page Select or Download Code

Replies are listed 'Best First'.
Re^4: Using filepath method to identify an .html page by Nik (Initiate) on Jan 22, 2013 at 17:11 UTC
Now that iam thinking of it more and more, i don't have to turn the 'path' back to a 'number' So, what i want is a function foo() that does this: foo( "some long string" ) --> 1234 ===================== 1. User requests a specific html page( .htaccess gives my script the absolute path for that .html page) 2. turn the 'path' to 4-digit number and store it to tha database as 'pin' (how?) 3. i store that number to the database. I DONT EVEN HAVE TO STORE THE HTML PAGE'S PATH TO THE DATABASE ANYMORE!!! this is just great! At some later time i want to check the weblog of that .html page 1. request the page as: http://mydomain.gr/index.html?show=log 2. .htaccess gives my script the absolute path of the requested .html file 3. turn the 'path' to 4-digit number (this is what i'am asking) 4. use 'pin' variable to select all log records for that specific .html page (based on the 'pin' column) Since i have the requested 'path' which has been converted to a database stored 4-digit number, i'am aware for which page i'am requesting detailed data from, so i look upon the 'pin' column in the database and thus i know which records i want to select. NO NEED to store absolute apths anymore, just a 4-digit number for each .html page No need, to turn the number back to a path anymore, just the path to a number, to identify the specific .html page Does your solution which SEEMS GREAT APPLY to my specifications?	[reply]
Re^5: Using filepath method to identify an .html page by blue_cowdawg (Monsignor) on Jan 22, 2013 at 17:33 UTC
( .htaccess gives my script the absolute path for that .html page) How's that? I've shown you a simple hash function to convert an arbitrary string into a four digit number. It's up to you to go from there... Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply]
Re^6: Using filepath method to identify an .html page by Nik (Initiate) on Jan 22, 2013 at 17:48 UTC
And i'am sure it will work great! Here is the .htaccess directives responsible to pass the actual .html page's absolute path to counter.py (please do not get mad at me, i'am trying to code your solution into python code, which is easier to me) `RewriteEngine On RewriteCond %{HTTP_HOST} ^(www\.)?superhost\.gr$ RewriteCond %{REQUEST_FILENAME} -f RewriteRule ^/?(.+\.html) /cgi-bin/counter.py?htmlpage=%{REQUEST_FILEN +AME} [L,PT,QSA]` [download] But my friend, why bother creating a custom function? Wouldn't something like the following work as intended? `pin = int( htmlpage.encode("hex"), 16 ) % 10000` [download]	[reply] [d/l] [select]
Re^7: Using filepath method to identify an .html page by blue_cowdawg (Monsignor) on Jan 22, 2013 at 18:13 UTC
Re^8: Using filepath method to identify an .html page by Nik (Initiate) on Jan 22, 2013 at 18:19 UTC
Some notes below your chosen depth have not been shown here
Re^4: Using filepath method to identify an .html page by Nik (Initiate) on Jan 22, 2013 at 19:23 UTC
The perl code will produce the same hash for "abc.html" as for "bca.html" In any case, the likelihood of a hash collision for any non-trivial website is substantial. If you hash 100 files you have about a 40% chance of a collision. If you hash 220 files, the likelihood is about 90%	[reply]
Re^5: Using filepath method to identify an .html page by blue_cowdawg (Monsignor) on Jan 22, 2013 at 19:34 UTC
The perl code will produce the same hash for "abc.html" as for "bca.html" Which underscores the point I made earlier about adding collision detection and rehashing logic to whatever algorithm you use. One workround I've seen: `\| handwaving here... my @i = split(//,$url); # put each letter in it's own bin my $j=0; # Initailize our my $k=1; # hashing increment values my @m=(); # workspace foreach my $n(@i){ my $q=ord($n); # ASCII for character $k += $j; # Increment our hash offset $q += $k; # add our "old" value $j = $k; # store that. push @m,$q; # save the offsetted value } my $hashval=0; #initialize our hash value # Generate that map { $hashval = ($hashval + $_) % 10000} @m;` [download] Using that method ABC.html and CBA.html now have different values because each letter position's value gets bumped up increasingly from left to right. Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply] [d/l]

In Section Seekers of Perl Wisdom