comment on

Here's a solution that takes your posted input and creates your expected output.

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dump;

my %data = (
    'rendition' => '3',
    'automation' => '2',
    'saturation' => '3',
    'mass creation' => 2,
    'automation technology' => 2,
    'automation technology process' => 3,
);  

dd \%data;

for my $multi_key (grep y/ / / != 0, keys %data) {
    next unless exists $data{$multi_key};
    for my $any_key (keys %data) {
        next if $any_key eq $multi_key;
        delete $data{$any_key} if 0 == index $multi_key, $any_key;
    }
}

dd \%data;
[download]

Output:

{
  "automation" => 2,
  "automation technology" => 2,
  "automation technology process" => 3,
  "mass creation" => 2,
  "rendition" => 3,
  "saturation" => 3,
}
{
  "automation technology process" => 3,
  "mass creation" => 2,
  "rendition" => 3,
  "saturation" => 3,
}
[download]

I saw your post (in isolation) before I logged in, thought it looked like an interesting problem, and wrote my solution before looking at any other replies. My code doesn't use any regexes or sorting which may help efficiency (see the final dot-point below for a clarification of that statement); any similarities to components used in other solutions is purely coincidental (although perhaps not surprising, e.g. you'll see delete used quite a bit).

I had a few issues with your spec; and now see I'm not alone. Again, some of these points may already have been raised.

Your OP words have "discarding the units that are contained in the longest ones"; however, your example only has keys (units) to be discarded that are at the start of the longest one. So how, for instance, would you deal with a key like 'partial saturation', given that it contains, but does not start with, the existing key 'saturation'.
You appear to assume that there can only be one longest key. How would you want to deal with, for example, 'automation technology special' (given the existing key 'automation technology process' which has the same length).
I interpret the spec as meaning that the two keys 'rendition' and 'renditions' would both be kept. Is that indeed what you want?
You wrote "quite long hashes"; unfortunately, that's rather vague. Descriptors such as "quite long", "fairly short" and so on, are highly subjective: often relative to the problem domain and the writer's experience. Actual numbers are much better; including other numbers such as available memory, disk size, etc. is better still (assuming they're relevant).
There's another issue with "quite long hashes". Does this refer to the number of key-value pairs or the actual length of the keys? Your examples suggest the values are all just small integers so that's not a size issue: is that correct?
Another piece of information, that would be useful to know, is the percentage of "multi word" keys. The single word keys are not actually part of the processing (beyond possibly being deleted). If your hash contains a million keys, do we need to process 10, 1,000, 100,000 or 999,999 multi word keys? This could make a difference to how we formulate a solution. I wrote (above) "... doesn't use any ... sorting which may help efficiency": the information about how many multi word keys there are could help to determine if adding sorting would, in fact, result in better efficiency; it could also help in deciding what type of sorting would be best and where such sorting might occur.

Update (typo): s/beyong/beyond/

— Ken

In reply to Re: retain longest multi words units from hash by kcott
in thread retain longest multi words units from hash by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.