dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks!

I have a number of strings, made up only of 64 characters: a-zA-Z0-9/+ . I need to collapse these down to just the unique characters in the string. I can do this with a one-liner, but there's got to be a better way to do this.

I've looked at regexes, and map, but I can't figure out anything really clean and fast.

Given input data like

9b/lllqtUst48MMMxwBHz+wluFguNx5h3DnyKxfxFjNazwc0X 9b/lllqtxTElC8GLsftS2RKkAxI1MfQTIuNx5h3P4eoEphA31djsn 9b/lllqtyk/lC8MMMxwwwhcTAxlhNBl1ugoluNx5h3fjxud309RVeCIY 9b/lllqtcxP8MMMxBvFfOh8lLxQTfguNx5h3LKrE0maElw 9b/lllqtdx48MMMxw6h4+mol1ugoluNx5h3AKrdGQ9OCnW 9b/lllqtf2l48MMMxDxaxxz29OAHIuNx5h3wG2TEyFqu3RdBin 9b/lllqthPElC8MMMxw78QQ0bfaPlI14TfguNx5h30beTAmP1cfA+Z 9b/lllqtryA8GL2m2s9OQrxzotuIuNx5h3MUw/45uzMeC5ww

and this one-liner:

perl -lne 'chomp; undef %h; @p=split(//,$_); foreach $k (@p) { $h{$k}= +$k; } print (sort keys(%h));' test-strings

I get these correct results:

+/034589BDFHKMNUXabcfghjlnqstuwxyz /1234589ACEGIKLMNPQRSTbdefhjklnopqstux /013589ABCIMNRTVYbcdefghjkloqtuwxy /03589BEFKLMNOPQTabcfghlmqrtuvwx +/1345689ACGKMNOQWbdghlmnoqrtuwx /234589ABDEFGHIMNORTabdfhilnqtuwxyz +/01345789ACEIMNPQTZabcefghlmqtuwx /234589ACGILMNOQUbehlmoqrstuwxyz

Any pointers are appreciated! Thanks!

Update: Got it, thanks! But the golfing has commenced...

Replies are listed 'Best First'.
Re: Collapsing a string to unique characters
by ikegami (Patriarch) on Jan 09, 2009 at 14:02 UTC

    In existing order:

    perl -nle"my %seen; print grep !$seen{$_}++, /./g"

    In lexical order:

    perl -nle"my %seen; print sort grep !$seen{$_}++, /./g"

    By the way, the chomp is useless because -nl already chomps.

    >perl -MO=Deparse -nle"foo()" BEGIN { $/ = "\n"; $\ = "\n"; } LINE: while (defined($_ = <ARGV>)) { chomp $_; foo(); } -e syntax OK

    Update: Oops, I had my tests inverted. Fixed.

      ikegami,

      I tried the lexical order method, and it's almost what I need...

      On a mac, I get this output:

      FMMNllltuwwxxxz 13AEITfhlllstxx /39CMMNhhlllllluuwwxxx 8ELMMfhllllxxx 49MMdhllllouxx 2239MMlllquxxxx 018AMMPPQTbbffhllllx /2559Mllrtuuwwxz

      If I change $seen{$_}++ to $seen{$_}+=2 then I get this:

      /034589BDFFHKMMMNNUXabcfghjllllnqsttuuwwwxxxxyzz /112334589AACEEGIIKLMNPQRSTTbdeffhhjkllllnopqssttuxxx //01335899ABCCIMMMNNRTVYbcdefghhhjkllllllloqtuuuwwwxxxxy /035889BEEFKLLMMMNOPQTabcffghhlllllmqrtuvwxxxx +/134456899ACGKMMMNOQWbddghhlllllmnooqrtuuwxxx /2223345899ABDEFGHIMMMNORTabdfhillllnqqtuuwxxxxxyz +/00113457889AACEIMMMNPPPQQTTZabbbcefffghhlllllmqtuwxx //2234555899ACGILMMNOQUbehlllmoqrrsttuuuwwwxxyzz
      which DOES list all the chars used, but has duplicates.
      *BRILLIANT* That bang did it. Sweet!
Re: Collapsing a string to unique characters
by Corion (Patriarch) on Jan 09, 2009 at 13:54 UTC

    If the order of the characters is of no concern, you can do it in one regex and the lookup hash:

    perl -wple "%seen=();s/(.)/$seen{$1}++?'':$1/ge"
      Holy moley. I need the order, but I'll save this for another use. Thanks!
Re: Collapsing a string to unique characters
by BrowserUk (Patriarch) on Jan 09, 2009 at 14:23 UTC

    Golf:55 (and no sort!)

    perl -ple"local($\",@_);@_[unpack'C*',$_]=split'';$_=qq[@_]" test-stri +ngs +/034589BDFHKMNUXabcfghjlnqstuwxyz /1234589ACEGIKLMNPQRSTbdefhjklnopqstux /013589ABCIMNRTVYbcdefghjkloqtuwxy /03589BEFKLMNOPQTabcfghlmqrtuvwx +/1345689ACGKMNOQWbdghlmnoqrtuwx /234589ABDEFGHIMNORTabdfhilnqtuwxyz +/01345789ACEIMNPQTZabcefghlmqtuwx /234589ACGILMNOQUbehlmoqrstuwxyz

    Unix version is one less:perl -ple'local($",@_);@_[unpack"C*",$_]=split"";$_=qq[@_]'

    Two less(thanks ikegami):perl -ple'local($",@_);@_[unpack"C*",$_]=split"";$_="@_"'

    Update:54: perl -ple"local(@_);@_[unpack'C*',$_]=split'';$_=join'',@_" test-strings

    50: -ple"@_=();@_[unpack'C*',$_]=split'';$_=join'',@_"


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Depending on platform and Perl version:
      Unix, pre 5.10: 42 -nle'@_=();@_[unpack"C*",$_]=/./g;print@_' Windows, pre 5.10: 40 -nle@_=();@_[unpack'C*',$_]=/./g;print@_ Unix, 5.10+: 40: -nlE'@_=();@_[unpack"C*",$_]=/./g;say@_' Windows, 5.10+: 38: -nlE@_=();@_[unpack'C*',$_]=/./g;say@_

        38, resp. 36 when using say instead of print

        -nle@_[unpack'C*',$_]=/./g;@_=!print@_

        Update:And 36 resp. 34:

        -nle@_[ord]=$_,for/(.)/g;@_=!print@_

        And if you're using the -E+say, you can shave off one more by leaving off the -l, at 33 strokes:

        -nE@_[ord]=$_,for/(.)/g;@_=!say@_

        Update2: And incorporating BrowserUk and JavaFan's ideas:

        # 34 strokes on Windows, and also Unix if your shell doesn't treat "!" + special -nE@_[map+ord,/./g]=/./g;@_=!say@_

        Minus 1 on all:

        Windows, pre 5.10: 39 -nlelocal@_[map+ord,/./g]=/./g;print@_

        You can leave off the -l. You don't need it for the output, as you're using say. And you don't need it for the chomp, as /./ doesn't match the newline.
      Be kind! It's still morning here, and I'm not done my tea!

      That's spectacular, but it hurts my brain...

Re: Collapsing a string to unique characters
by JavaFan (Canon) on Jan 09, 2009 at 15:07 UTC
    35 (Unix):
    -nE'@_=();@_[ord]=$_ for/./g;say@_'

      Anywhere, any version: 32

      -ple$_=join'',sort/./g;tr/!-~//s
        Anywhere
        Nope, it'll fail on an EBCDIC platform as the ~ is somewhere between the 'r' and the 's'.

      Applying my same tricks, you can replace the @_=();say@_ initialization by @_=!say@_, which is two chars shorter,yielding 31 on Windows and 33 on Unix:

      -nE@_[ord]=$_,for/./g;@_=!say@_
Re: Collapsing a string to unique characters
by jwkrahn (Abbot) on Jan 09, 2009 at 14:16 UTC
    $ echo "9b/lllqtUst48MMMxwBHz+wluFguNx5h3DnyKxfxFjNazwc0X 9b/lllqtxTElC8GLsftS2RKkAxI1MfQTIuNx5h3P4eoEphA31djsn 9b/lllqtyk/lC8MMMxwwwhcTAxlhNBl1ugoluNx5h3fjxud309RVeCIY 9b/lllqtcxP8MMMxBvFfOh8lLxQTfguNx5h3LKrE0maElw 9b/lllqtdx48MMMxw6h4+mol1ugoluNx5h3AKrdGQ9OCnW 9b/lllqtf2l48MMMxDxaxxz29OAHIuNx5h3wG2TEyFqu3RdBin 9b/lllqthPElC8MMMxw78QQ0bfaPlI14TfguNx5h30beTAmP1cfA+Z 9b/lllqtryA8GL2m2s9OQrxzotuIuNx5h3MUw/45uzMeC5ww " | perl -pe'1 while s/(.)(?=.*\1)//' 9b/qUst48MBH+lgu5h3DnyKfxFjNazwc0X 9b/qlC8GLtS2RKkMfQTIuNx5P4eoEphA31djsn bqtyk/8MwcTAB1golN5hfjxud309RVeCIY 9b/qtcPMBvFO8QTfguNx5h3LKr0maElw b/qt8Mw64+m1goluNx5h3AKrdGQ9OCnW b/tfl48MDaz9OAHINx5hwG2TEyFqu3RdBin 9/qtECMw78QalI4guNx5h30beTmP1cfA+Z blqyA8GLm2s9OQrotINxh3U/4uzMeC5w
Re: Collapsing a string to unique characters
by gone2015 (Deacon) on Jan 09, 2009 at 15:11 UTC

    I tried:

    sub ext_s { # Returns characters in sorted order my ($s) = @_ ; my %h ; @h{split(//, $s)} = undef ; return join('', sort keys %h) ; } ; sub ext_o { # Returns characters in original order my ($s) = @_ ; my @h ; return join('', grep { !$h[ord($_)]++ } split(//, $s)) ; } ;
    compared to:
    sub ext_dw { my ($s) = @_ ; my %h ; $h{$_} = undef foreach split(//, $s) ; return join('', sort keys %h) ; } ; sub ext_cn { my ($s) = @_ ; my %h ; $s =~ s/(.)/$h{$1}++?'':$1/ge ; return $s ; } ;
    and benchmarked:
          Rate   cn   dw    s    o
    cn 10101/s   -- -40% -54% -59%
    dw 16949/s  68%   -- -22% -31%
    s  21739/s 115%  28%   -- -11%
    o  24390/s 141%  44%  12%   --
    

      oshalla,

      too nice, you beat me to it. Considering I'm going to be doing this for 100 million strings, that's great to know.

Re: Collapsing a string to unique characters
by JavaFan (Canon) on Jan 10, 2009 at 10:32 UTC

      No good. Hash keys are unordered.

      C:\test>perl -ple$_=join'',sort/./g;y///cs test-strings +/034589BDFHKMNUXabcfghjlnqstuwxyz /1234589ACEGIKLMNPQRSTbdefhjklnopqstux /013589ABCIMNRTVYbcdefghjkloqtuwxy /03589BEFKLMNOPQTabcfghlmqrtuvwx +/1345689ACGKMNOQWbdghlmnoqrtuwx /234589ABDEFGHIMNORTabdfhilnqtuwxyz +/01345789ACEIMNPQTZabcefghlmqtuwx /234589ACGILMNOQUbehlmoqrstuwxyz C:\test>\Perl510\bin\perl5.10.0.exe -nE"@;{/./g}=();%;=!say%;" test-st +rings /aNKjyugtsBHcDqbzUwFxMh0fnX39+8l45 S/TNKd2Eju1ktesqbIGxQhMCfLAn3P98lp4Ro5 /TNdYjyu1kgteBcqbIwxVhM0CfA398lR5o /aTNKEugtvBcqbwFrxQhM0LfO3Pm98l5 /NKdu1gtWqbGwrxQhMCA6nO3m9+8l45o /TaNdE2yutBHDqbIGzFwxhMfiAnO398l4R5 /TaN7EZu1gtecqbIwxQMh0CfA3Pm9+8l45 /N2yutesqbIGzUwrxQMhCLAO3m98l45o

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        No good. Hash keys are unordered.
        So? The OP didn't make it a requirement the result was ordered:
        I have a number of strings, made up only of 64 characters: a-zA-Z0-9/+ . I need to collapse these down to just the unique characters in the string.

        Besides, Perlmonks has a long tradition of making small changes to the requirements for the sake of winning at golf. ;-)

      That is too nice, but you will have to explain how it works.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        -nE'@;{/./g}=();%;=!say%;'

        /./g is in list context, so it's a shorthand for /(.)/g and will hence return a list of all characters (without the newline).

        @;{...} is a slice of the hash %;. @;{/./g} = () sets all values in the slice to undef. The keys are the characters of the line.

        say %; prints the hash; as key-value pairs. Since the values are undefined, the values are printed as empty strings. So, in effect, it prints all the characters of the line, without duplicates.

        %;=!say%; say will return true, so its negation will be the empty string. So it'll make %; have one element: the empty string as key, and the undefined value as value. This will be printed for the next line, but since they are both printed as empty strings, you won't actually see it.