Script to reduce tokens to minimal unique characters

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Script to reduce tokens to minimal unique characters by thanos1983 (Parson) on Jun 12, 2015 at 21:28 UTC
Hello Anonymous, Well this site is about discussion and assisting people with their problems. Not asking people to do your work! Having said that you do not provide us with enough data to answer to your question. I mean based on your data, how many times you want to read and reduce the tokens, is it once or infinite times? How the tokens will be provided? You want to open a file and read the file into an input and then process it, or a user will provide the document as input. Apart from all these notes, produce a minimal example of code to show at leas the effort that you tried to solve it. Update: Sample of code: `#!/usr/bin/perl use strict; use warnings; my @tokens = ("report_time" , "report_day" , "reset" , "read"); foreach my $singleToken (@tokens) { if ($singleToken eq "report_time") { print "report_t" . "\n"; } elsif ($singleToken eq "report_day") { print "report_d" . "\n"; } elsif ($singleToken eq "reset") { print "res" . "\n"; } elsif ($singleToken eq "read") { print "rea" . "\n"; } } __DATA__ report_t report_d res rea` [download] Is this the expected output? Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Script to reduce tokens to minimal unique characters by QM (Parson) on Jun 15, 2015 at 10:52 UTC
A quick hack to find all of the unique token abbreviations, construct a regex, and report the matching tokens. #!/usr/bin/env perl use strict; use warnings; my @tokens = qw/report_time report_day reset read/; my @abbrev; for my $token (@tokens) { if (1 == grep /^$token/, @tokens) { push @abbrev, $token; } else { die "Error: $token is not unique in list of tokens\n"; } # Generate abbreviations that match only one token my $abbrev = $token; while (1) { chop($abbrev); last unless (1 == grep /^$abbrev/, @tokens); push @abbrev, $abbrev; } } # Show the abbreviations print "Abbreviations = (" . join(',', sort @abbrev) . ")\n"; # Create the regex string to use my $regex = join('\|', sort @abbrev); print "Token regex = \"$regex\"\n\n"; # Read the file and output tokens while (<>) { my $match; if (($match) = m/^($regex)$/) { my @matches = grep /^$match/, @tokens; print "$match matched for @matches on line $.\n"; } } exit; [download] Input file: `report_t 14:09:33 PDT report_d Fri Jun 12 2015 res Resetting the time report_time 00:00:00` [download] Output: `Abbreviations = (rea,read,report_d,report_da,report_day,report_t,repor +t_ti,report_tim,report_time,res,rese,reset) Token regex = "rea\|read\|report_d\|report_da\|report_day\|report_t\|report_ +ti\|report_tim\|report_time\|res\|rese\|reset" report_t matched for report_time on line 1 report_d matched for report_day on line 3 res matched for reset on line 6 report_time matched for report_time on line 9` [download] There's probably a module or two that will do this for you, but I didn't bother to look for it. Edit to add: I also thought of creating regex expressions, as above, but with a single subexpression per unique token. For `report_time`, the regex is: `m/^report_t(?:i(?:m(?:e)?)?)?/` [download] It's a bit more complicated to do this than the above script's method. For a small list of `@tokens`, it makes little difference. For huge lists, it's probably better to put the original pipecleaner version through an optimizer, which will be somewhat better than this head-scratching fingernails version. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re: Script to reduce tokens to minimal unique characters by Anonymous Monk on Jun 12, 2015 at 22:44 UTC
Here's a guess at what you want: `#!/usr/bin/perl # http://perlmonks.org/?node_id=1130246 use strict; use warnings; my @tokens = qw/report_time report_day reset read/; while(<DATA>) { /^(\w+)/ or next; my $part = $1; print "$part\n" if 1 == grep /^$part/, @tokens; } __DATA__ report_t 14:09:33 PDT report_d Fri Jun 12 2015 report (should not show up) res Resetting the time report_time 00:00:00 rea foo.bar Info: reading file foo.bar` [download]	[reply] [d/l]
Re: Script to reduce tokens to minimal unique characters by Anonymous Monk on Jun 13, 2015 at 01:56 UTC
On the other hand, maybe you wanted a hash with all the valid unique prefixes of @tokens, like this: `#!/usr/bin/perl # http://perlmonks.org/?node_id=1130246 use strict; use warnings; my @tokens = qw/report_time report_day reset read/; my %valid; # generate hash with valid unique matches /^(.+)(??{$valid{$1}++})^/ for @tokens; delete @valid{grep $valid{$_} > 1, keys %valid}; /^(\w+)/ and $valid{$1} and print "$1\n" while <DATA>; __DATA__ report_t 14:09:33 PDT report_d Fri Jun 12 2015 report (should not show up) res Resetting the time report_time 00:00:00 rea foo.bar Info: reading file foo.bar` [download]	[reply] [d/l]
Re: Script to reduce tokens to minimal unique characters by gator456 (Novice) on Jun 15, 2015 at 16:15 UTC
Sorry for the anonymous post. I thought I was logged in when I hit submit. In response to the first poster, I am not a college student and this is not a home work assignment. I work for a CAD company and do software support. I often get log files from customers that i need to debug. I want to write a script to extract the commands the customer ran. I can get a list of the valid command via help. However the UI allows the commands to be abbreviated. I have over 20 years experience with Perl but I did not quite know how to solve the problem. So instead of spinning my wheels, I decided to leverage perl monks. Thank you everyone for other posts, they were the type of ideas I was looking for.	[reply]
Re: Script to reduce tokens to minimal unique characters by AnomalousMonk (Archbishop) on Jun 16, 2015 at 02:51 UTC
Here's an approach that also generates a hash of unique leading substrings for a set of strings. This hash is then used to generate a set of regexes for unique matching along the lines of QM's final thought above. This is more complex, but IMHO more flexible than some of the other approaches. YMMV. Read more... (6 kB) Update: Here's a simpler (and I expect faster, but I've done no Benchmark-ing) definition of the `LeadingDistinct::diff()` subroutine. The (assumed) speed-up won't be significant unless you're processing millions of symbols. Tested. `sub diff { ($_[0] ^ $_[1]) =~ m{ \A \x00* }xms; return $+[0]; }` [download] Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]