At first glance, you'd think this would be a trivial problem, with a relatively straightforward solution like:
Build new empty hash 'output' Build a list of keys from longest to shortest For each key in the list If key not a fragment of a key in the output hash Add key to output hash
However, it feels like the problem isn't fully specified. How do you recognize units contained within other units, is it simply a substring?
Often, when I see problems like this, I try to "break" them by contriving corner cases to throw into the mix. When I saw your problem, I came up with:
my %phrase_counts = ( 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2, 'automation technology process' => 3, 'technology process' => 5, 'automation process' => 2, );
I added two phrases so that we have something like 'A B C', 'A B', 'A C', and 'B C'. Obviously 'A B' and 'B C' are contained in 'A B C' and should be discarded. But what about 'A C', should that be considered to be contained in 'A B C'? What about 'A C B', 'C B'?
If you don't think about these corner cases, then you can wind up in a frustrating cycle: you come up with an algorithm, code and test it, and submit it. Then the program runs OK for a bit, only to get a bug report against it when they come up with a corner case. You then update your algorithm, code and test it, submit it again. Then another bug report comes in.
If you try to come up with ugly situations, you can save yourself a lot of time by asking for clarification for those special cases beforehand. Another benefit of coming up with special cases is that it can provide hints towards coming up with better algorithms--which may also prompt you to come up with a few more special cases.
I can think of several variations of this problem and solutions to the variations, but don't know which one(s) to suggest.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
In reply to Re: retain longest multi words units from hash
by roboticus
in thread retain longest multi words units from hash
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |