Wodenic has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a script which generates the longest common substring + 1 character for each line in a file.
The existing Algorithm::Diff doesn't support what I'm looking for.
eg a file with lines containing:

RPL AUSTRIA AR REPORTING
RPL AUSTRIA AR REV ACCT
..
..

I'm trying to get the script to return the smallest number of characters to uniquely identify each line.
eg the script should return

RPL AUSTRIA AR REP
RPL AUSTRIA AR REV

the Algorithm::Diff package is too greedy and returns ALL common characters in order
eg RPL AUSTRIA AR RE T

Anyone have any ideas ? or do I feel a new Algorithm coming on :)

Thanks
Wodenic

Replies are listed 'Best First'.
Re: longest common substring... almost?
by kvale (Monsignor) on Apr 18, 2004 at 18:06 UTC
    Text::Abbrev, a core module, can do what you want:
    use Text::Abbrev; chomp(my @a = <DATA>); s/\s/_/g foreach @a; my %hash = abbrev @a; print "$_ $hash{$_}\n" foreach sort keys %hash; __DATA__ RPL AUSTRIA AR REPORTING RPL AUSTRIA AR REV ACCT
    This yields
    RPL_AUSTRIA_AR_REP RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPO RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPOR RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORT RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTI RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTIN RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REPORTING RPL_AUSTRIA_AR_REV RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_A RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_AC RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ACC RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ACCT RPL_AUSTRIA_AR_REV_ACCT
    the set of all possible unique abbreviations of your list of words, which is frequently what one wants. If you want just the shortest abbrev, pick the first member for each value from the sorted list above. The underscores are for readablility, you can drop them.

    -Mark

Re: longest common substring... almost?
by haoess (Curate) on Apr 18, 2004 at 19:26 UTC

    There's a difference between

    I'm trying to write a script which generates the longest common substring + 1 character for each line in a file.

    and

    I'm trying to get the script to return the smallest number of characters to uniquely identify each line.

    Assume the following lines:

    AA AB ABC ABCDE

    In the first case, the output should be

    AA AB AB AB

    In the second case, output should be

    AA AB ABC ABCD

    For the first case, this should work:

    -- Frank ('s first post at perlmonks)

      Later in the original question:

      I'm trying to get the script to return the smallest number of characters to uniquely identify each line.

      This makes clear that the poster wants the second case.

Re: longest common substring... almost?
by borisz (Canon) on Apr 18, 2004 at 17:47 UTC
Re: longest common substring... almost?
by ambrus (Abbot) on Apr 23, 2004 at 13:35 UTC
Re: longest common substring... almost?
by Wassercrats (Initiate) on Apr 20, 2004 at 14:06 UTC
    I just posted something similar here. I'm not sure it's what you want, but it works for me.