Re: retain longest multi words units from hash

Here's a solution that takes your posted input and creates your expected output.

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dump;

my %data = (
    'rendition' => '3',
    'automation' => '2',
    'saturation' => '3',
    'mass creation' => 2,
    'automation technology' => 2,
    'automation technology process' => 3,
);  

dd \%data;

for my $multi_key (grep y/ / / != 0, keys %data) {
    next unless exists $data{$multi_key};
    for my $any_key (keys %data) {
        next if $any_key eq $multi_key;
        delete $data{$any_key} if 0 == index $multi_key, $any_key;
    }
}

dd \%data;
[download]

Output:

{
  "automation" => 2,
  "automation technology" => 2,
  "automation technology process" => 3,
  "mass creation" => 2,
  "rendition" => 3,
  "saturation" => 3,
}
{
  "automation technology process" => 3,
  "mass creation" => 2,
  "rendition" => 3,
  "saturation" => 3,
}
[download]

I saw your post (in isolation) before I logged in, thought it looked like an interesting problem, and wrote my solution before looking at any other replies. My code doesn't use any regexes or sorting which may help efficiency (see the final dot-point below for a clarification of that statement); any similarities to components used in other solutions is purely coincidental (although perhaps not surprising, e.g. you'll see delete used quite a bit).

I had a few issues with your spec; and now see I'm not alone. Again, some of these points may already have been raised.

Your OP words have "discarding the units that are contained in the longest ones"; however, your example only has keys (units) to be discarded that are at the start of the longest one. So how, for instance, would you deal with a key like 'partial saturation', given that it contains, but does not start with, the existing key 'saturation'.
You appear to assume that there can only be one longest key. How would you want to deal with, for example, 'automation technology special' (given the existing key 'automation technology process' which has the same length).
I interpret the spec as meaning that the two keys 'rendition' and 'renditions' would both be kept. Is that indeed what you want?
You wrote "quite long hashes"; unfortunately, that's rather vague. Descriptors such as "quite long", "fairly short" and so on, are highly subjective: often relative to the problem domain and the writer's experience. Actual numbers are much better; including other numbers such as available memory, disk size, etc. is better still (assuming they're relevant).
There's another issue with "quite long hashes". Does this refer to the number of key-value pairs or the actual length of the keys? Your examples suggest the values are all just small integers so that's not a size issue: is that correct?
Another piece of information, that would be useful to know, is the percentage of "multi word" keys. The single word keys are not actually part of the processing (beyond possibly being deleted). If your hash contains a million keys, do we need to process 10, 1,000, 100,000 or 999,999 multi word keys? This could make a difference to how we formulate a solution. I wrote (above) "... doesn't use any ... sorting which may help efficiency": the information about how many multi word keys there are could help to determine if adding sorting would, in fact, result in better efficiency; it could also help in deciding what type of sorting would be best and where such sorting might occur.

Update (typo): s/beyong/beyond/

— Ken

Comment on Re: retain longest multi words units from hash Select or Download Code