Re: Bloom Filter or other mehod to store URL's?

This got me thinking about how much redundancy there is in urls, 90% start with www, and are .com, plenty of index.htmls out there, cgi-bin etc. If you split the url up and use each part as a key in a multilevel hash perl only stores each unique key once. Searching is fast and you will get no false possitives. Let the code speak ...

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my %store;
while (<DATA>) {
        next if /^\s*$/;
        chomp;
        s !//!/!g;
        s !:!!g;
        my @bits = split /\.|\//;
        my $key = ( join "}{", @bits);
        my $do_this = '$store{' . $key . '}++';
        eval $do_this;
}

print Dumper(\%store);

__DATA__
http://www.foo.com/this.html
http://www.foo.com/index.html
http://www.foo.com/that.html
http://www.foo.com/cgi-bin/that.cgi
http://www.foo.com/cgi-bin/user.cgi
http://www.fred.com/index.html
http://www.foo.org/index.html

# output
$VAR1 = {
    'http' => {
                'www' => {
                           'foo' => {
                                      'org' => {
                                                 'index' => {         
+                                                           'html' => 
+1                                                                  }
                                               },
                                      'com' => {
                                                       'index' => {
                                                                    'h
+tml' => 1
                                                                  },
                                                       'that' => {
                                                                   'ht
+ml' => 1
                                                                 },
                                                       'this' => {
                                                                   'ht
+ml' => 1
                                                                 }
                                                     }
                                          },
                                 'fred' => {
                                             'com' => {
                                                        'index' => {
                                                                     '
+html' => 1
                                                                   }
                                                }
                                     }
                         }
              }
  };
[download]

Cheers,
R.

Pereant, qui ante nos nostra dixerunt!

Comment on Re: Bloom Filter or other mehod to store URL's? Download Code