Re^2: Using filepath method to identify an .html page

Replies are listed 'Best First'.
Re^3: Using filepath method to identify an .html page by blue_cowdawg (Monsignor) on Jan 22, 2013 at 16:50 UTC
OK: How about this: `$ cat testMD5.pl use strict; foreach my $url(qw@ /index.html /about/time.html @){ hashit($url); } sub hashit { my $url=shift; my @ltrs=split(//,$url); my $hash = 0; foreach my $ltr(@ltrs){ $hash = ( $hash + ord($ltr)) %10000; } printf "%s: %0.4d\n",$url,$hash }` [download] which yields: `$ perl testMD5.pl /index.html: 1066 /about/time.html: 1547` [download] Keep in mind this is hardly bullet proof. You need to also keep in mind a method to detect hash collisions and and a rehash algorithm. This brings to mind "the old days" circa 1974 writing assemblers for 8080 microprocessors. Symbol "folding" and hashing. UPDATE: Limiting yourself to four digits may not be very useful if you have a lot of pages that you are trying to index into your database. The wider your hash is the less likely there will be hash collisions. Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply] [d/l] [select]
Re^4: Using filepath method to identify an .html page by Nik (Initiate) on Jan 22, 2013 at 19:23 UTC
The perl code will produce the same hash for "abc.html" as for "bca.html" In any case, the likelihood of a hash collision for any non-trivial website is substantial. If you hash 100 files you have about a 40% chance of a collision. If you hash 220 files, the likelihood is about 90%	[reply]
Re^5: Using filepath method to identify an .html page by blue_cowdawg (Monsignor) on Jan 22, 2013 at 19:34 UTC
The perl code will produce the same hash for "abc.html" as for "bca.html" Which underscores the point I made earlier about adding collision detection and rehashing logic to whatever algorithm you use. One workround I've seen: `\| handwaving here... my @i = split(//,$url); # put each letter in it's own bin my $j=0; # Initailize our my $k=1; # hashing increment values my @m=(); # workspace foreach my $n(@i){ my $q=ord($n); # ASCII for character $k += $j; # Increment our hash offset $q += $k; # add our "old" value $j = $k; # store that. push @m,$q; # save the offsetted value } my $hashval=0; #initialize our hash value # Generate that map { $hashval = ($hashval + $_) % 10000} @m;` [download] Using that method ABC.html and CBA.html now have different values because each letter position's value gets bumped up increasingly from left to right. Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply] [d/l]
Re^4: Using filepath method to identify an .html page by Nik (Initiate) on Jan 22, 2013 at 17:11 UTC
Now that iam thinking of it more and more, i don't have to turn the 'path' back to a 'number' So, what i want is a function foo() that does this: foo( "some long string" ) --> 1234 ===================== 1. User requests a specific html page( .htaccess gives my script the absolute path for that .html page) 2. turn the 'path' to 4-digit number and store it to tha database as 'pin' (how?) 3. i store that number to the database. I DONT EVEN HAVE TO STORE THE HTML PAGE'S PATH TO THE DATABASE ANYMORE!!! this is just great! At some later time i want to check the weblog of that .html page 1. request the page as: http://mydomain.gr/index.html?show=log 2. .htaccess gives my script the absolute path of the requested .html file 3. turn the 'path' to 4-digit number (this is what i'am asking) 4. use 'pin' variable to select all log records for that specific .html page (based on the 'pin' column) Since i have the requested 'path' which has been converted to a database stored 4-digit number, i'am aware for which page i'am requesting detailed data from, so i look upon the 'pin' column in the database and thus i know which records i want to select. NO NEED to store absolute apths anymore, just a 4-digit number for each .html page No need, to turn the number back to a path anymore, just the path to a number, to identify the specific .html page Does your solution which SEEMS GREAT APPLY to my specifications?	[reply]
Re^5: Using filepath method to identify an .html page by blue_cowdawg (Monsignor) on Jan 22, 2013 at 17:33 UTC
( .htaccess gives my script the absolute path for that .html page) How's that? I've shown you a simple hash function to convert an arbitrary string into a four digit number. It's up to you to go from there... Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply]
Re^6: Using filepath method to identify an .html page by Nik (Initiate) on Jan 22, 2013 at 17:48 UTC
Re^7: Using filepath method to identify an .html page by blue_cowdawg (Monsignor) on Jan 22, 2013 at 18:13 UTC
Some notes below your chosen depth have not been shown here


Clear questions and runnable code get the best and fastest answer
	PerlMonks