hash substrings

perlcat has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perlmonks,

I would like to delete hash keys that are a substring of other hash keys.

My data looks like this:

this is a test => 2
is a test => 2
a test => 1
[download]

I get this data from an array in which I count the unique elements and put them in a hash, counting the number of occurrences:

sub count_unique {
 # extract unique elements of array and count occurrences
 my @array = @_;

 map { $count{$_}++ } @array;
 map { "$_ = ${count{$_}}\n"} sort keys(%count);

 return %count;
}
[download]

I then sort the hash by the string length of the key, decreasingly:

sub hashValueDescendingNum {
 # sort hash by key length in descending order
 length($a) <=> length($b)
}
[download]

Where I'm stuck is when I try to find out if element+1 is a substring of the current element. In the example below, a loop should remove everything except 'this is a test'.

this is a test => 2
is a test => 2
a test => 1
[download]

I though of getting the hash's size and then doing a for $i loop inside it, but that's as far as I got.

Thanks in advance for your help.

Larry

Comment on hash substrings Select or Download Code

Replies are listed 'Best First'.
Re: hash substrings by citromatik (Curate) on Jan 27, 2009 at 07:38 UTC
I would rather sort the keys in ascending order (of key length), store them in an array, and compare the elements of that array, something like: `use strict; use warnings; my %h = ( 'this is a test' => 2, 'is a test' => 2, 'a test' => 1 ); my @keys = sort {length $a <=> length $b} keys %h; for ( 0 .. $#keys-1){ if ($keys[$_+1] =~ /$keys[$_]/){ delete $h{$keys[$_]}; } }` [download] Hope this helps citromatik	[reply] [d/l]
Re^2: hash substrings by perlcat (Novice) on Jan 27, 2009 at 09:03 UTC
thanks, it worked like a charm! Larry	[reply]
Re^3: hash substrings by jethro (Monsignor) on Jan 27, 2009 at 15:06 UTC
Why is it enough for you to compare the strings only to the next element? Are keys like `"This is a test" "something" "a test"` [download] impossible in your setting?	[reply] [d/l]
Re: hash substrings by gone2015 (Deacon) on Jan 27, 2009 at 10:08 UTC
How many hash keys do you have ? Because this is going to scale horribly -- O(N2). Rather than crank out a search loop, I'd have a go at getting Perl to do the heavy lifting: `use strict ; use warnings ; my %h = ('this is a test' => 2, 'is a test' => 2, 'a test' => 1, 'this is' => 1, 'is a' => 1, 'unique' => 1, 'also Unique' => 3, ) ; my @k = sort { length($b) <=> length($a) } keys(%h) ; my $s = "\0". join("\0", @k)."\0" ; # Assumption: "\0" doesn't appear + in any key study $s ; foreach my $k (@k) { $s =~ m/(.)$k(.)/ ; if (($1 ne "\0") \|\| ($2 ne "\0")) { delete $h{$k} ; } ; } ; print join(', ', map "'$_'", sort keys %h), "\n" ;` [download] (it might save a little time if all keys of the largest length were removed from `@k` before the `foreach`). Update: unless I've misunderstood, and your objective isn't to remove all substrings... Further Update:** Having tried this on some 100,000 keys (chosen at random, but ensuring at least some substrings) I found that the `study $s` didn't help. I was also worried about the `(.)` in the match. I found that: `my @k = sort { length($b) <=> length($a) } keys(%h2) ; my $s = "\0". join("\0", @k)."\0" ; my @a = () ; my $p = 0 ; foreach my $k (@k) { $p += length($k) + 1 ; push @a, $p ; } ; foreach my $k (@k) { pos $s = 0 ; $s =~ m/$k/g ; next if pos($s) == shift(@a) ; delete $h2{$k} ; } ;` [download] was more than twice as fast (saving 85 seconds on my machine) for 100,000 keys; twice as fast (saving 0.8 seconds) for 10,000 keys; but for 1,000 keys the speed difference was in the noise. But this: `my @k = sort { length($b) <=> length($a) } keys(%h) ; my $s = '' ; foreach my $k (@k) { delete($h4{$k}) and next if ($s =~ m/$k/) ; $s .= $k . "\0" ; } ;` [download] ran a little faster still, and will work better the more substrings there are -- so this is what I would use. For completeness: `my @k = sort { length($a) <=> length($b) } keys(%h3) ; my $lx = length($k[-1]) ; my $ll = 0 ; my $li = 0 ; for my $k (@k) { if (length($k) != $ll) { $ll = length($k) ; last if $ll == $lx ; ++$li until length($k[$li]) > $ll ; } ; for my $i ($li..$#k) { next if ($k[$i] !~ /$k/) ; delete $h3{$k} ; last ; } ; } ;` [download] which cranks the search by hand, took 24.5s for 10,000 keys -- compared to 0.75 for the fastest method !	[reply] [d/l] [select]
Re^2: hash substrings by jwkrahn (Abbot) on Jan 27, 2009 at 10:43 UTC
`foreach my $k (@k) { $s =~ m/(.)$k(.)/ ; if (($1 ne "\0") \|\| ($2 ne "\0")) { delete $h{$k} ; } ; }` [download] You shouldn't use the variables `$1` and `$2` unless the match was successful because they retain their previous value if the match failed and so after the first successful match every key will be deleted. `for my $k ( @k ) { if ( $s =~ /\0$k\0/ ) { delete $h{ $k }; } }` [download]	[reply] [d/l] [select]
Re^3: hash substrings by gone2015 (Deacon) on Jan 27, 2009 at 10:54 UTC
Good general advice... in this case, however, every match will succeed -- the test is required to spot the key matching itself. For the paranoid one could write: `foreach my $k (@k) { next unless $s =~ m/(.)$k(.)/ ; next if (($1 eq "\0") && ($2 eq "\0")) ; delete $h{$k} ; } ;` [download] but the `next unless` is redundant.	[reply] [d/l] [select]
Re^2: hash substrings by perlcat (Novice) on Jan 27, 2009 at 18:08 UTC
thanks for looking into this. I will test the various solutions this weekend. Again, many thanks (to all of you).	[reply]