longest common substring... almost?

Wodenic has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a script which generates the longest common substring + 1 character for each line in a file.
The existing Algorithm::Diff doesn't support what I'm looking for.
eg a file with lines containing:

RPL AUSTRIA AR REPORTING
RPL AUSTRIA AR REV ACCT
..
..

I'm trying to get the script to return the smallest number of characters to uniquely identify each line.
eg the script should return

RPL AUSTRIA AR REP
RPL AUSTRIA AR REV

the Algorithm::Diff package is too greedy and returns ALL common characters in order
eg RPL AUSTRIA AR RE T

Anyone have any ideas ? or do I feel a new Algorithm coming on :)

Thanks
Wodenic

Comment on longest common substring... almost?

Replies are listed 'Best First'.
Re: longest common substring... almost? by kvale (Monsignor) on Apr 18, 2004 at 18:06 UTC
Text::Abbrev, a core module, can do what you want: `use Text::Abbrev; chomp(my @a = <DATA>); s/\s/_/g foreach @a; my %hash = abbrev @a; print "$_ $hash{$_}\n" foreach sort keys %hash; __DATA__ RPL AUSTRIA AR REPORTING RPL AUSTRIA AR REV ACCT` [download] This yields RPL_AUSTRIA_AR_REP RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPO RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPOR RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORT RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTI RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTIN RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REV RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_A RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_AC RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ACC RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ACCT [download] the set of all possible unique abbreviations of your list of words, which is frequently what one wants. If you want just the shortest abbrev, pick the first member for each value from the sorted list above. The underscores are for readablility, you can drop them. -Mark	[reply] [d/l] [select]
Re: longest common substring... almost? by borisz (Canon) on Apr 18, 2004 at 17:47 UTC
use String::Ediff. Boris	[reply]
Re: longest common substring... almost? by haoess (Curate) on Apr 18, 2004 at 19:26 UTC
There's a difference between I'm trying to write a script which generates the longest common substring + 1 character for each line in a file. and I'm trying to get the script to return the smallest number of characters to uniquely identify each line. Assume the following lines: `AA AB ABC ABCDE` [download] In the first case, the output should be `AA AB AB AB` [download] In the second case, output should be `AA AB ABC ABCD` [download] For the first case, this should work: Read more... (890 Bytes) -- Frank ('s first post at perlmonks)	[reply] [d/l] [select]
Re: Re: longest common substring... almost? by ambrus (Abbot) on Apr 19, 2004 at 09:03 UTC
Later in the original question: I'm trying to get the script to return the smallest number of characters to uniquely identify each line. This makes clear that the poster wants the second case.	[reply]
Re: longest common substring... almost? by ambrus (Abbot) on Apr 23, 2004 at 13:35 UTC
See the thread Generate unique initials from a list of names with almost the same question.	[reply]
Re: longest common substring... almost? by Wassercrats (Initiate) on Apr 20, 2004 at 14:06 UTC
I just posted something similar here. I'm not sure it's what you want, but it works for me.	[reply]