Re: hash substrings

How many hash keys do you have ? Because this is going to scale horribly -- O(N**2).

Rather than crank out a search loop, I'd have a go at getting Perl to do the heavy lifting:

use strict ; use warnings ;

my %h = ('this is a test' => 2,
         'is a test' => 2,
         'a test' => 1,
         'this is' => 1,
         'is a' => 1,
         'unique' => 1,
         'also Unique' => 3,
        ) ;

my @k = sort { length($b) <=> length($a) } keys(%h) ;

my $s = "\0". join("\0", @k)."\0" ;  # Assumption: "\0" doesn't appear
+ in any key
study $s ;

foreach my $k (@k) {
  $s =~ m/(.)$k(.)/ ;
  if (($1 ne "\0") || ($2 ne "\0")) { delete $h{$k} ; } ;
} ;

print join(', ', map "'$_'", sort keys %h), "\n" ;
[download]

(it might save a little time if all keys of the largest length were removed from @k before the foreach).

Update: unless I've misunderstood, and your objective isn't to remove all substrings...

Further Update: Having tried this on some 100,000 keys (chosen at random, but ensuring at least some substrings) I found that the study $s didn't help.

I was also worried about the (.) in the match. I found that:

  my @k = sort { length($b) <=> length($a) } keys(%h2) ;

  my $s = "\0". join("\0", @k)."\0" ;

  my @a = () ;
  my $p = 0 ;
  foreach my $k (@k) {
    $p += length($k) + 1 ;
    push @a, $p ;
  } ;

  foreach my $k (@k) {
    pos $s = 0 ;
    $s =~ m/$k/g ;
    next if pos($s) == shift(@a) ;
    delete $h2{$k} ;
  } ;
[download]

was more than twice as fast (saving 85 seconds on my machine) for 100,000 keys; twice as fast (saving 0.8 seconds) for 10,000 keys; but for 1,000 keys the speed difference was in the noise.

But this:

  my @k = sort { length($b) <=> length($a) } keys(%h) ;

  my $s = '' ;

  foreach my $k (@k) {
    delete($h4{$k}) and next if ($s =~ m/$k/) ;
    $s .= $k . "\0" ;
  } ;
[download]

ran a little faster still, and will work better the more substrings there are -- so this is what I would use.

For completeness:

  my @k = sort { length($a) <=> length($b) } keys(%h3) ;

  my $lx = length($k[-1]) ;
  my $ll = 0 ;
  my $li = 0 ;

  for my $k (@k) {
    if (length($k) != $ll) {
      $ll = length($k) ;
      last if $ll == $lx ;
      ++$li until length($k[$li]) > $ll ;
    } ;

    for my $i ($li..$#k) {
      next if ($k[$i] !~ /$k/) ; 
      delete $h3{$k} ;
      last ;
    } ; 
  } ;
[download]

which cranks the search by hand, took 24.5s for 10,000 keys -- compared to 0.75 for the fastest method !

Comment on Re: hash substrings Select or Download Code

Replies are listed 'Best First'.
Re^2: hash substrings by jwkrahn (Abbot) on Jan 27, 2009 at 10:43 UTC
`foreach my $k (@k) { $s =~ m/(.)$k(.)/ ; if (($1 ne "\0") \|\| ($2 ne "\0")) { delete $h{$k} ; } ; }` [download] You shouldn't use the variables `$1` and `$2` unless the match was successful because they retain their previous value if the match failed and so after the first successful match every key will be deleted. `for my $k ( @k ) { if ( $s =~ /\0$k\0/ ) { delete $h{ $k }; } }` [download]	[reply] [d/l] [select]
Re^3: hash substrings by gone2015 (Deacon) on Jan 27, 2009 at 10:54 UTC
Good general advice... in this case, however, every match will succeed -- the test is required to spot the key matching itself. For the paranoid one could write: `foreach my $k (@k) { next unless $s =~ m/(.)$k(.)/ ; next if (($1 eq "\0") && ($2 eq "\0")) ; delete $h{$k} ; } ;` [download] but the `next unless` is redundant.	[reply] [d/l] [select]
Re^2: hash substrings by perlcat (Novice) on Jan 27, 2009 at 18:08 UTC
thanks for looking into this. I will test the various solutions this weekend. Again, many thanks (to all of you).	[reply]