Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: retain longest multi words units from hash

by atcroft (Abbot)
on Jul 30, 2018 at 01:37 UTC ( [id://1219443]=note: print w/replies, xml ) Need Help??


in reply to retain longest multi words units from hash

Here is my attempt at a solution (also trying my hand at POD (which I badly need to learn!)). Comments and corrections appreciated, as always.

Solution overview: This particular solution attempt takes all loaded entries, processing them in key length (smallest first) then string order. Each key is split on non-word characters. If no non-word characters are present, the entry is copied from the originating hash to the working hash. If one or more non-word characters are present, then the possible partial keys are generated. For each partial key, if an entry for the partial key exists in the working hash, it is deleted. Once the partial keys have been processed, the current entry is copied to the working hash.

Test results:

  1. Original dataset:
    %data = (
              'original' => {
                              'automation' => 2,
                              'automation technology' => 2,
                              'automation technology process' => 3,
                              'mass creation' => 2,
                              'rendition' => 3,
                              'saturation' => 3
                            },
              'resulting' => {
                             'automation technology process' => 3,
                             'mass creation' => 2,
                             'rendition' => 3,
                             'saturation' => 3
                           }
            );
    
  2. Enhanced data set 1:
    %data = (
              'original' => {
                              'automation' => 2,
                              'automation technology' => 2,
                              'automation technology process' => 3,
                              'bar' => 1,
                              'bar baz' => 1,
                              'bar baz quux' => 1,
                              'baz' => 1,
                              'baz quux' => 1,
                              'foo' => 1,
                              'foo bar' => 1,
                              'mass creation' => 2,
                              'quux' => 1,
                              'quuz' => 1,
                              'rendition' => 3,
                              'saturation' => 3
                            },
              'resulting' => {
                             'automation technology process' => 3,
                             'bar baz quux' => 1,
                             'foo bar' => 1,
                             'mass creation' => 2,
                             'quuz' => 1,
                             'rendition' => 3,
                             'saturation' => 3
                           }
            );
    
  3. Enhanced data set 2:
    %data = (
              'original' => {
                              'automation' => 2,
                              'automation technology' => 2,
                              'automation technology process' => 3,
                              'bar' => 2,
                              'bar baz' => 1,
                              'bar baz quux' => 1,
                              'bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'baz' => 1,
                              'baz quux' => 1,
                              'baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'foo' => 2,
                              'foo bar' => 1,
                              'foo-bar' => 1,
                              'foo-bar-baz' => 1,
                              'foo-bar-baz-qux' => 1,
                              'foo-bar-baz-qux-quux' => 1,
                              'foo-bar-baz-qux-quux-quuz' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge-grault' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge-grault-garply' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy' => 1,
                              'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'fred-plugh-xyzzy-thud' => 1,
                              'grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'mass creation' => 2,
                              'plugh-xyzzy-thud' => 1,
                              'quux' => 1,
                              'quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'quuz' => 1,
                              'quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                              'rendition' => 3,
                              'saturation' => 3,
                              'thud' => 1,
                              'waldo-fred-plugh-xyzzy-thud' => 1,
                              'xyzzy-thud' => 1
                            },
              'resulting' => {
                             'automation technology process' => 3,
                             'bar baz quux' => 1,
                             'foo bar' => 1,
                             'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                             'mass creation' => 2,
                             'rendition' => 3,
                             'saturation' => 3
                           }
            );
    

Code:

#!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Getopt::Long; use List::MoreUtils qw{ uniq }; # My preferences regarding Data::Dumper output. $Data::Dumper::Deepcopy = 1; $Data::Dumper::Sortkeys = 1; $| = 1; my $_DEBUG = 0; # Toggle debug print statements my %data; # Load data from DATA handle while ( my $line = <DATA> ) { chomp $line; $line =~ s/\# .* $//msx; # Remove comments from data $line =~ s/ \s* $//msx; # Remove trailing whitespace next if ( $line =~ m/^ \s* $/imsx ); # Exclude blank lines $data{original}{$line}++; } %{ $data{resulting} } = retain_longest_mw_units( %{ $data{original} } ); print Data::Dumper->Dump( [ \%data, ], [qw( *data )] ), qq{\n}; # ====================================== # Subroutines # ====================================== sub retain_longest_mw_units { my (%dataset) = @_; my %workingset; my @seen_key = uniq sort { length $a <=> length $b || $a cmp $b } keys %dataset; print Data::Dumper->Dump( [ \@seen_key, ], [qw( *seen_key )] ), qq{\n} if ($_DEBUG); my @key_length; foreach my $str (@seen_key) { my $str_val = $dataset{$str}; my @non_word = $str =~ m/(\W)/gmsx; print Data::Dumper->Dump( [ \$str, \@non_word, ], [qw( *str *non_word )] ), qq{\n} if ($_DEBUG); if ( !scalar @non_word ) { $workingset{$str} = $dataset{$str}; } else { my @separator; my $lookup_start = 0; push @separator, { boln => 1, eoln => 0, idx => 0, position => 0, }; foreach my $i ( 0 .. $#non_word ) { my $separator_idx = index $str, $non_word[$i], $lookup_start; push @separator, { eoln => 0, idx => scalar @separator, character => $non_word[$i], position => $separator_idx, }; $lookup_start = $separator_idx + 1; } push @separator, { eoln => 1, idx => scalar @separator, len => length $str, }; print Data::Dumper->Dump( [ \$str, \@separator, ], [qw( *str *separator )] ), qq{\n} if ($_DEBUG); my %intended = (); foreach my $i ( 0 .. $#separator ) { foreach my $j ( $i .. $#separator ) { next if ( $i == $j ); next if ( ( $i == 0 ) and ( $j == $#separator ) ); my $intended_key = q{}; my $start = -1; my $str_len = -1; if ( $separator[$i]{boln} ) { $start = $separator[$i]{position}; $str_len = $separator[$j]{eoln} ? $separator[$j]{len} : $separator[$j]{position}; $intended_key = substr $str, $start, $str_len; } else { $start = $separator[$i]{position} + 1; $str_len = ( $separator[$j]{eoln} ? $separator[$j]{len} : $separator[$j]{position} ) - $start; $intended_key = substr $str, $start, $str_len; } print sprintf join( qq{\t}, q{i: %d}, q{j: %d}, q{start: %d}, q{str_len: %d}, q{intended_key: %s}, ) . qq{\n}, $i, $j, $start, $str_len, $intended_key if ($_DEBUG); $intended{$intended_key}++; } } foreach my $k ( keys %intended ) { if ( exists $workingset{$k} ) { delete $workingset{$k}; } } $workingset{$str} = $dataset{$str}; print Data::Dumper->Dump( [ \$str, \%intended, ], [qw( *str *intended )] ), qq{\n} if ($_DEBUG); } } print Data::Dumper->Dump( [ \@seen_key, \@key_length ], [qw( *seen_key *key_length )] ), qq{\n} if ($_DEBUG); return %workingset; } =pod =head1 NAME 1219394-atcroft_attempt.pl - atcroft's attempt at the problem =head1 SYNOPIS ./1219394-atcroft_attempt.pl =head1 DESCRIPTION =head2 HISTORY A L<post|https://www.perlmonks.org/?node_id=1219394> was submitted 2018-07-28T00:42Z by someone not logged into the site (referred to as an I<Anonymous Monk>, and hereafter abbreviated I<AM>). The problem posed was given a hash containing multi-word units and their frequency, they wanted to remove any units contained within a longer unit. The example provided by the AM was the following: # Input $VAR1 = { 'rendition' => '3', 'automation' => '2', 'saturation' => '3', 'mass creation' => 2, 'automation technology' => 2 'automation technology process' => 3 }; # Desired output $VAR1 = { 'rendition' => '3', 'saturation' => '3', 'mass creation' => 2, 'automation technology process' => 3 }; This code attempts to answer that question. =head1 DETAILS L<LanX|https://www.perlmonks.org/?node_id=708738> was the first L<response|https://www.perlmonks.org/?node_id=708738>, with the suggestion of a partial solution by doing the following: =over 4 =item C<loop over the keys> =item C<split multi-words into a list> =item C<if the list contains more than one word, delete single words in that list> =back He updated a little while later noting that his suggestion missed multi-word duplicates. It was at after reading that post (in oldest-first order) that this author set about trying to accomplish the task himself. This particular solution attempt takes all loaded entries, processing them in key length (smallest first) then string order. Each key is split on non-word characters. If no non-word characters are present, the entry is copied from the originating hash to the working hash. If one or more non-word characters are present, then the possible partial keys are generated. For each partial key, if an entry for the partial key exists in the working hash, it is deleted. Once the partial keys have been processed, the current entry is copied to the working hash. =head1 DEPENDENCIES This code uses the L<List::MoreUtils> uniq function, to avoid having to copy one or roll my own. =head1 BUGS Assume they are likely present, and proceed appropriately. If confirmed, please inform this author. =over 4 =item It is likely more verbose than necessary. =item It likely contains evidence of bad programming habits. =item It likely contains evidence of bad programming style(s). =item It breaks keys on non-word characters (which may be an issue depending on the reader's expection regarding word characters). =item It may contain other limitations and/or restrictions. =back =head1 AUTHOR L<atcroft|https://www.perlmonks.org/?node_id=70929> =head1 COPYRIGHT AND LICENSE Copyright 2018 by atcroft This code is free software; you may redistribute it and/or modify it under the same terms as Perl itself. =cut __DATA__ # Original data (matches original counts) rendition automation mass creation automation technology rendition automation saturation mass creation automation technology process saturation automation technology automation technology process saturation rendition automation technology process # Included as part of Enhanced Data Sets 1 & 2 foo bar baz quux quuz foo bar bar baz baz quux bar baz quux # Included as part of Enhanced Data Set 2 foo bar foo-bar foo-bar-baz foo-bar-baz-qux foo-bar-baz-qux-quux foo-bar-baz-qux-quux-quuz foo-bar-baz-qux-quux-quuz-corge foo-bar-baz-qux-quux-quuz-corge-grault foo-bar-baz-qux-quux-quuz-corge-grault-garply foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud corge-grault-garply-waldo-fred-plugh-xyzzy-thud grault-garply-waldo-fred-plugh-xyzzy-thud waldo-fred-plugh-xyzzy-thud fred-plugh-xyzzy-thud plugh-xyzzy-thud xyzzy-thud thud foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-t +hud

Hope it helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1219443]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-03-29 02:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found