Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Generate unique ids of maximum length

by lima1 (Curate)
on Apr 12, 2010 at 16:11 UTC ( [id://834306]=perlquestion: print w/replies, xml ) Need Help??

lima1 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

this is probably a trivial question but I could not find a good answer - most likely because I've searched with the wrong keywords.

My problem: I have a list of unique ids. I want to shorten the ids so that they have a maximum $length but the ids should still be unique. As another requirement, the new ids should be as similar as possible to the old ones.

The following quick and dirty solution works in my cases, but can fail (for example, set $length to 2 - I was just too lazy to fix this). It could be improved in several ways, for example instead of appending numbers from 1 to n, cat pre- and suffixes (in the DATA example, Lenoc3_caA, Lenoc3_caB instead of Lenoc3_ca1,...).

My question is now is there already something (better) on CPAN? If not, do you think it's worth packaging or do you have better solutions?

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; sub unique_ids { my ( $length, $ids_ref ) = @_; # first check if we need to do something my %u_ids; %u_ids = map { my $s = substr($_, 0, $length); $s => ++$u_ids{$s} +} @{$ids_ref}; if (scalar(keys %u_ids) == scalar @{$ids_ref}) { return @{$ids_ref}; } # fix the non-unique ids my @u_ids; my %incr; for my $id (@{$ids_ref}) { my $s = substr($id, 0, $length); if ($u_ids{$s} > 1) { die 'Length too small' if length($u_ids{$s}) > $length; #this string could be not unique!!! $s = substr($id,0,$length-length($u_ids{$s})) . ++$incr{$s +}; } push @u_ids, $s; } return @u_ids; } my @ids = <DATA>; chomp @ids; my @u_ids = unique_ids(10,\@ids); warn Dumper \@u_ids; __DATA__ A2990_duallayer_1 A2990_duallayer_2 A2990_duallayer_3 A2990_duallayer_4 A2990_duallayer_5 A2990_duallayer_6 A2990_duallayer_7 A2990_duallayer_8 A2990_duallayer_9 A2990_duallayer_10 LXP_01 LXP_02 LXP_03 LXP_04 LXP_05 LXP_06 LXP_07 LXP_08 LXP_09 LXP_10 LXP_11 LXP_12 LXP_13 LXP_14 LXP_15 LXP_16 LXP_17 LXP_18 Normal_1 Normal_2 Normal_3 Normal_4 Normal_5 Normal_6 Lenoc3_carina_A Lenoc3_carina_B Lenoc3_carina_C Lenoc3_duallayer_1 Lenoc3_duallayer_2 Lenoc3_duallayer_3 Lenoc5_carina_1 Lenoc5_carina_2 Lenoc5_carina_3 Lenoc5_duallayer_1 Lenoc5_duallayer_2 Lenoc5_duallayer_3

Replies are listed 'Best First'.
Re: Generate unique ids of maximum length
by ikegami (Patriarch) on Apr 12, 2010 at 18:53 UTC
    A2990_du_1 A2990_du_2 A2990_du_3 A2990_du_4 A2990_du_5 A2990_du_6 A2990_du_7 A2990_du_8 A2990_du_9 A2990_du_10 LXP_01 LXP_02 LXP_03 LXP_04 LXP_05 LXP_06 LXP_07 LXP_08 LXP_09 LXP_10 LXP_11 LXP_12 LXP_13 LXP_14 LXP_15 LXP_16 LXP_17 LXP_18 Len3_ca_A Len3_ca_B Len3_ca_C Len3_du_1 Len3_du_2 Len3_du_3 Len5_ca_1 Len5_ca_2 Len5_ca_3 Len5_du_1 Len5_du_2 Len5_du_3 No_1 No_2 No_3 No_4 No_5 No_6

    can be obtained using the following:

    use strict; use warnings; sub add { my $p = \shift; $p = \( $$p->{$_} ) for @_; $$p->{''} = 1; } sub shorten_unsplit { our $fixed; local *fixed = \$_[0]; our $unsplit; local *unsplit = \$_[1]; for ($unsplit) { if ( s/^([^A-Za-z]+[A-Za-z]?)// ) { $fixed .= $1; redo; } if ( s/^(?=(.))[A-Z]*[a-z]*//s ) { $fixed .= $1; redo; } } } sub shorten { my @results; local *helper = sub { my ($trie, $fixed, $unsplit) = @_; my $single = ( keys(%$trie) == 1 ); shorten_unsplit($fixed, $unsplit) if !$single || exists($trie->{''}); for my $key ( sort {; no warnings 'numeric'; $a <=> $b || $a cmp $b } keys(%$trie) ) { if ($key eq '') { push @results, $fixed; } elsif ($single) { helper($trie->{$key}, $fixed, "$unsplit$key"); } else { helper($trie->{$key}, "$fixed$key", ''); } } }; my $trie; add($trie, /\d+|./sg) for @_; return if !$trie; helper($trie, '', ''); return @results; } { chomp( my @data = <DATA> ); print("$_\n") for shorten(@data); } __DATA__ ...

    Useful test case: Add A2990_dualplayer_10.

    The code can manipulated to remove less when desired.

    Update: Fixed a bug.
    Update: And another.
    Update: Improved sorting (1,2,...,9,10 vs 1,10,2,...,9).
    Update: Input strings can now be substrings of other input strings. Simplified code at the same time.

      Wow, nice code. Took me a while to get it, though!
      However, A2990_duallayerA_1 is not shortened in any way... Nevertheless, nice code. Some comments would still be helpful for faster understanding.

        To shorten "A2990_duallayerA_1" when "A2990_duallayer_1" is also present would require removing from the middle of the word, and that goes against your examples. You didn't specify your spec, so I had to guess a lot.

        To handle this case, you could consider a lowercase followed by an uppercase to be a word break. Change

        if ($key =~ /^[a-zA-Z]\z/) {

        to

        if ( $key =~ /^[a-z]\z/ || $key =~ /^[A-Z]\z/ && $flux !~ /[a-z]\z/ ) {

        You get:

        A2990_duA_1 A2990_du_1 A2990_du_2 A2990_du_3 A2990_du_4 A2990_du_5 A2990_du_6 A2990_du_7 A2990_du_8 A2990_du_9 A2990_du_10 LXP_01 LXP_02 ...

        Fixed in original code.

Re: Generate unique ids of maximum length
by jonadab (Parson) on Apr 12, 2010 at 18:01 UTC

    I think you're trying to use the same field for two things: a unique ID, and a human-readable name. You can do that, but there are going to be trade-offs, and one of them is that your unique ID field, in order to remain human-readable, will be longer than would otherwise be necessary. You can use abbreviations up to a point, but eventually you're going to run into "I don't remember what L stands for, was that Lexmark or Luminox or Lenoc or Linksys or what?"

    Personally, I would use two fields: a unique ID field (which could just be an auto_increment field if you're storing these in a RDBMS) and a separate human-readable name field that doesn't have to be so short. (You can put a must-be-unique constraint on the name field or not, depending on your preference.)

    Just a thought.

      Whould be cool, yes, but in my case unfortunately not possible. A visualization tool I need for downstream analyses only supports IDs of length 10 or less. So it just makes life easier when the new IDs are close to the old ones.
Re: Generate unique ids of maximum length
by CountZero (Bishop) on Apr 12, 2010 at 17:19 UTC
    The "as similar as possible to the old ones" is what makes it difficult. How would you define it in a way that a computer understands?

    And how dissimilar are you allowed to get, if you save on the length?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Hm, good point. Probably something like the edit distance, but with the preference to keep the prefix and suffix to make the ids readable. So one simple algorithm would optimize the sum of edit distances under the constraint that the ids have to be unique.
Re: Generate unique ids of maximum length
by BioLion (Curate) on Apr 12, 2010 at 17:29 UTC

    I am sure someone will disagree, but i think your approach is fine (apart from the caveats you mention). An alternate approach might be to make a totally unique id, that fits your requrements (I can't remember where i originally found this code - sorry if it is you!):

    sub generate_random_string { my $length_of_randomstring = shift; # the length of the random +string to generate my @chars=('a'..'z','A'..'Z','0'..'9','_'); my $random_string; foreach (1..$length_of_randomstring) { # rand @chars will generate a random # number between 0 and sc +alar @chars $random_string.=$chars[rand @chars]; } return $random_string; }

    Or is should imagine there are random id generators on CPAN...

    And associate that with the readable id using a lookup table or similar(I assume this is why you want to keep them "as similar as possible"?). You can then convert between the two for printing purposes, but for storage, all your entries have a proper (but non-sensical) random and unique id.

    Just my opinion, but i hope it helps...

    Just a something something...

      Why a random string and not simply a sequence number?  The latter would have the advantage of definitely being unique (besides being trivial to implement), while a random string might (in theory) repeat, and thus not necessarily be "totally unique"... (which means you'd need to check against a hash holding already used IDs).  Also, for the reverse lookup a simple array would suffice.

        Good points - especially the array part - I was just thinking about the 'strings of the same length' part. Either way, the point is the same.

        Just a something something...
Re: Generate unique ids of maximum length
by Anonymous Monk on Apr 12, 2010 at 18:02 UTC
    Perhaps feed all the strings into Regexp::Assemble (or Regexp::Assemble::Compressed?), and then analyze the regexp it produces to remove "common" information from your existing strings? I've never used the ::Compressed module, so I don't have any feel for how it is different.
Re: Generate unique ids of maximum length
by jeffreyray (Sexton) on Apr 12, 2010 at 18:14 UTC

    I use String::Random when wanting to create randomly generated identifiers. I like the randregex method - which takes a regex and will generate a random string that will conform to the regex.

    This doesn't solve the problem you are - but I thought you/others may find it useful for solving this type of problem.

    It also doesn't verify that the id is unique among others, but that can be accomplished trivially using a hash with the identifiers as keys.

Re: Generate unique ids of maximum length
by rubasov (Friar) on Apr 13, 2010 at 20:00 UTC

    Another idea probably worth mentioning: embrace tab completion. Let's suppose we can use tab completion on your input, for instance consider this line:

    Lenoc3_duallayer_1 ^^ ^ ^ ^

    You must type the marked characters, but the others are optional, they can be completed by pressing tab. Therefore in the shortening process always keep the mandatory characters, but leave out as many optional as needed starting from the right end.

    limit shortened 10 Lenoc3_du1 ... 6 Len3d1 5 Le3d1 4 not possible

    This is achievable by parsing your input into a suffix tree and operating on that.

    L -> XP_ -> 0 -> ... -> 1 -> ... -> enoc -> 3_ -> carina_ -> ... -> duallayer_ -> ... -> 5_ -> carina -> ... -> duallayer -> ...

    The first characters of the tree node strings are the mandatory ones, the others are optional. This tree structure seems similar to ikegami's code, however I haven't understood that fully yet, so I'm not sure what's the difference. For me the benefit of this approach seems to be this: you don't need to use the concept of word. Another pro for this algorithm is that you don't need to number your shortened identifiers. (But numbering your input lines (in the base of your input character set) is not bad because that produces the shortest possible unique ids.)

    I have some proof of concept code, but take it with a grain of salt: it's unnecessarily complex, employs dirty hacks etc.

    Some explanation: I've started by parsing the input into a character-level suffix tree. This tree uses hashes, the keys are the substrings, the values are hashrefs of what can follow. A possible ending position of the string is marked by "" => undef. Then collapsed it to a substring-level suffix tree and descended into that to shorten the ids.

    Sorry for the late post, yesterday I haven't got enough time to write this.

    p.s.: Am I right this is guaranteed to produce unique ids? I'm not sure, but it seems good.

    update: answered my own question: Yes, the uniqueness of the shortened ids is guaranteed, no matter what set of optional characters are left out.

      This tree structure seems similar to ikegami's code, however I haven't understood that fully yet, so I'm not sure what's the difference.

      My tree is the same as your $ctree:

      A -- ... / \ X -- P -- ... \ / L \ e -- n -- o -- c -- ...

      except numbers are considered atomic in mine.

      I don't bother collapsing into your $stree form:

      A -- ... / \ XP -- ... \ / L \ enoc -- ...

      You must type the marked characters, but the others are optional

      The difference with mine is that I made more character mandatory. The rational is that the OP wanted to the result to resemble the original as much as possible.

      Your mandatory characters:

      • Ambiguous characters.
      Lenoc3_duallayer_1 -> Le3d1 ^^ ^ ^ ^

      My mandatory characters:

      • Ambiguous characters.
      • A sequence of 0+ uppercase letters followed by a sequence of 0+ lowercase letters preceding an ambiguous lowercase letter.
      • First and second unambiguous character of a text sequence
      • Nonletters (digits, underscores)
      Lenoc3_duallayer_1 -> Len3_du_1 ^^^ ^^^^ ^^

        Thanks for the explanation, I've got it. (Then consider my node as an explanation to ikegami's node.)

        It's funny how many different ways people interpret similarity/resemblance, because that was exactly the reason why I chose to keep all the optional characters if the id already fit in the char limit. That way my code always keeps the under-limit ids identical (== more similar). Of course in other cases that's not the optimal choice.

        I also thought of (but not implemented) a more generic way to decide which character to drop from the original id: provide the user a filter callback in which s?he can rate the characters (or substrings) considered, then drop the ones with the lowest rating (still from right to left). For example: [_ ] => 3, [A-Z] => 2, [a-z]=> 1, anything else => 0

        And this is why I've collapsed the char-level suffix tree to the substring-level: to ease the access to substrings for the purpose of rating. And also because the structure of the tree in the substring-level form cannot interfere with the selection of (non-)ambiguous characters (as in choroba's remark above if I get it right).

        Cheers

Re: Generate unique ids of maximum length
by LanX (Saint) on Apr 13, 2010 at 13:56 UTC
    your "new ids should be as similar as possible to the old ones" is poorly defined and -sorry - your code is far too undocumented to be read or published (especially for such a delicate use case)

    I'll give you a concept instead of code:

    For what I understand you need 5 steps

    1. split /_/
    2. split /(\d+)/
    3. count nondigit substrings in a hash %strings
    4. shorten keys of this hash "as similar as possible"
    5. recompose

    The magic is of course in step 4 (and you have to care about about different lengths of numbers in step 2)

    for instance you could start to investigate the first n chars of each \w+ string in a loop.

    As long as a grep(/$shortend\w*/) keys %strings counts more than 1 match you have to try again with a larger $n.

    you can extend this method to  grep(/$pre\w*$post/) ...

    To further shorten the recomposed string you might consider using CamelCase instead of _ as delimiter (i.e. uppercase first character)

    To be clear, all of this doesn't guaranty len(strings) <= $length, so you might be forced to skip readability for some keys in an extra step of shortening.

    Cheers Rolf

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://834306]
Approved by Corion
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-29 05:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found