At first glance, you'd think this would be a trivial problem, with a relatively straightforward solution like:

Build new empty hash 'output' Build a list of keys from longest to shortest For each key in the list If key not a fragment of a key in the output hash Add key to output hash

However, it feels like the problem isn't fully specified. How do you recognize units contained within other units, is it simply a substring?

Often, when I see problems like this, I try to "break" them by contriving corner cases to throw into the mix. When I saw your problem, I came up with:

my %phrase_counts = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, 'technology process' => 5, 'automation process' => 2, );

I added two phrases so that we have something like 'A B C', 'A B', 'A C', and 'B C'. Obviously 'A B' and 'B C' are contained in 'A B C' and should be discarded. But what about 'A C', should that be considered to be contained in 'A B C'? What about 'A C B', 'C B'?

If you don't think about these corner cases, then you can wind up in a frustrating cycle: you come up with an algorithm, code and test it, and submit it. Then the program runs OK for a bit, only to get a bug report against it when they come up with a corner case. You then update your algorithm, code and test it, submit it again. Then another bug report comes in.

If you try to come up with ugly situations, you can save yourself a lot of time by asking for clarification for those special cases beforehand. Another benefit of coming up with special cases is that it can provide hints towards coming up with better algorithms--which may also prompt you to come up with a few more special cases.

I can think of several variations of this problem and solutions to the variations, but don't know which one(s) to suggest.

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re: retain longest multi words units from hash by roboticus
in thread retain longest multi words units from hash by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.