jzelkowsz has asked for the wisdom of the Perl Monks concerning the following question:

I need to preserve all alphabetic elements of the array but the duplicate numeric elements must be removed. The original array is 45,000+ elements. I am trying to end up with a result like the below (yes, the pipe is required):

20055111|YOUSLAV,YURT,TENWIMPL 20011271|YOUSLAV,WUMARTHE 20011541|YOUSLAV,TENWIMPL 20102741|WEDLOFOU,YOUSLAV,YURT,KUPLYSO,TENWIMPL 20155505|YOUSLAV,YURT,TENWIMPL 20147155|YOUSLAV,KUPLYSO,FRIMA

The original data looked like this:
20055111,YOUSLAV, 20055111,YURT, 20055111,TENWIMPL, 20011271,YOUSLAV, 20011271,WUMARTHE 20011541,YOUSLAV, 20011541,TENWIMPL, 20102741,WEDLOFOU, 20102741,YOUSLAV, 20102741,YURT, 20102741,KUPLYSO, 20102741,TENWIMPL, 20155505,YOUSLAV, 20155505,YURT, 20155505,TENWIMPL, 20147155,YOUSLAV, 20147155,KUPLYSO, 20147155,FRIMA,

I have tried the below but (unfortunately) it removes ALL duplicate elements. I am trying to preserve the alphabetic elements.

sub uniq { my %seen; grep !$seen{$_}++, @_; } my @cert = qw( 20055111 YOUSLAV 20055111 YURT 20055111 TENWIMPL 20011271 YOUSLAV + 20011271 WUMARTHE 20011541 YOUSLAV 20011541 TENWIMPL 20102741 WED +LOFOU 20102741 YOUSLAV 20102741 YURT 20102741 KUPLYSO 20102741 TE +NWIMPL 20155505 YOUSLAV 20155505 YURT 20155505 TENWIMPL 20147155 +YOUSLAV 20147155 KUPLYSO 20147155 FRIMA ); my @filtered = uniq(@cert); print "@filtered\n";

Below is a sample of the file I am trying to work with. I replace all the commas with spaces in my array:

20055111,YOUSLAV, 20055111,YURT, 20055111,TENWIMPL, 20011271,YOUSLAV, 20011271,WUMARTHE, 20011541,YOUSLAV, 20011541,TENWIMPL, 20102741,WEDLOFOU, 20102741,YOUSLAV, 20102741,YURT, 20102741,KUPLYSO, 20102741,TENWIMPL, 20155505,YOUSLAV, 20155505,YURT, 20155505,TENWIMPL, 20147155,YOUSLAV, 20147155,KUPLYSO, 20147155,FRIMA, 20172145,TENWIMPL, 20172175,TENWIMPL, 20175511,FRIMA, 20174117,TENWIMPL, 20175410,TENWIMPL, 20175554,YOUSAID, 20202011,FRIMATEC, 20214475,CIPWOMAT, 20271275,YOUSLAV, 20271275,YURT, 20271275,TENWIMPL, 20217175,YURT, 20217175,KUPLYSO, 20217175,TENWIMPL, 20217177,WEDLOFOU, 20217177,YOUSLAV, 20217177,YURT, 20217177,YURTRN, 20217177,YURTRN, 20217177,TENWIMPL, 20217177,WEDLOFOU, 20217177,YOUSLAV, 20217177,KUPLYSO, 20217177,TENWIMPL, 20217171,YOUSLAV, 20217171,YURT, 20217171,TENWIMPL, 20217171,YOUSLAV, 20217171,YURT, 20217171,TENWIMPL, 20217110,WEDLOFOU, 20217110,YOUSLAV, 20217110,KUPLYSO, 20217110,TENWIMPL, 20217112,YOUSLAV, 20217112,YOUTESSNO, 20217112,YOUTESSNO, 20217507,YOUSLAV, 20217501,WEDLOFOU, 20217501,YOUSLAV, 20217501,TENWIMPL, 20217512,TENWIMPL, 20217517,YOUSLAV, 20217517,FRIMA, 20217517,YOUSLAV, 20217517,YURT, 20217517,TENWIMPL, 20217511,YOUSLAV, 20217511,SYMKIR, 20217511,TENWIMPL, 20217520,WEDLOFOU, 20217520,YOUSLAV, 20217520,TENWIMPL, 20217521,YOUSLAV, 20217521,TENWIMPL, 20217522,WEDLOFOU, 20217522,YOUSLAV, 20217522,CIPWOMAT, 20217522,TENTMIR, 20217522,TENTMIR, 20217555,YOUSLAV, 20217555,YURT, 20217555,TENWIMPL, 20217557,CODNGSPC, 20217774,YOUSLAV, 20217774,KUPLYSO,

  • Comment on How do I remove duplicate numeric elements of an array and preserve alphabetic elements?
  • Select or Download Code

Replies are listed 'Best First'.
Re: How do I remove duplicate numeric elements of an array and preserve alphabetic elements?
by hippo (Archbishop) on Jun 04, 2018 at 14:32 UTC
Re: How do I remove duplicate numeric elements of an array and preserve alphabetic elements?
by BrowserUk (Patriarch) on Jun 04, 2018 at 15:26 UTC

    Try this:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my( @ordered, %grouped ); while( <DATA> ) { chomp; my @pair = split ',', $_; $ordered[ @ordered ] = $pair[ 0 ] unless exists $grouped{ $pair[ 0 + ] }; push @{ $grouped{ $pair[0] } }, $pair[1]; } #pp \@ordered, \%grouped; print "$_|", join ',', @{ $grouped{ $_ } } for @ordered; __DATA__ 20055111,YOUSLAV, 20055111,YURT, 20055111,TENWIMPL, 20011271,YOUSLAV, 20011271,WUMARTHE 20011541,YOUSLAV, 20011541,TENWIMPL, 20102741,WEDLOFOU, 20102741,YOUSLAV, 20102741,YURT, 20102741,KUPLYSO, 20102741,TENWIMPL, 20155505,YOUSLAV, 20155505,YURT, 20155505,TENWIMPL, 20147155,YOUSLAV, 20147155,KUPLYSO, 20147155,FRIMA,
    Output:
    C:\test>1215831.pl 20055111|YOUSLAV,YURT,TENWIMPL 20011271|YOUSLAV,WUMARTHE 20011541|YOUSLAV,TENWIMPL 20102741|WEDLOFOU,YOUSLAV,YURT,KUPLYSO,TENWIMPL 20155505|YOUSLAV,YURT,TENWIMPL 20147155|YOUSLAV,KUPLYSO,FRIMA

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      I like your code++.

      A couple of minor nits:
      I personally try to avoid using subscripts in favor of assigning a name to variables in a split.
      In this case, I'm not sure what the number represents (or the name), I'm sure the OP knows better a better description than us.
      I would use a push instead of array assignment to @ordered, just because it seems more natural to me.
      This is minor stuff - no problem at all with your code.

      Update: I guess I don't know what is supposed to happen if say YOUSLAV appeared twice for 20055111 or whether that is even possible to occur. If that is possible, the OP should clarify.

      #! perl -slw use strict; use Data::Dump qw[ pp ]; my( @ordered, %grouped ); while( <DATA> ) { chomp; my ($number, $name) = split ',', $_; push (@ordered, $number ) unless exists $grouped{$number }; push @{ $grouped{ $number } }, $name; } #pp \@ordered, \%grouped; print "$_|", join ',', @{ $grouped{ $_ } } for @ordered; =prints 20055111|YOUSLAV,YURT,TENWIMPL 20011271|YOUSLAV,WUMARTHE 20011541|YOUSLAV,TENWIMPL 20102741|WEDLOFOU,YOUSLAV,YURT,KUPLYSO,TENWIMPL 20155505|YOUSLAV,YURT,TENWIMPL 20147155|YOUSLAV,KUPLYSO,FRIMA =cut __DATA__ 20055111,YOUSLAV, 20055111,YURT, 20055111,TENWIMPL, 20011271,YOUSLAV, 20011271,WUMARTHE 20011541,YOUSLAV, 20011541,TENWIMPL, 20102741,WEDLOFOU, 20102741,YOUSLAV, 20102741,YURT, 20102741,KUPLYSO, 20102741,TENWIMPL, 20155505,YOUSLAV, 20155505,YURT, 20155505,TENWIMPL, 20147155,YOUSLAV, 20147155,KUPLYSO, 20147155,FRIMA,
        You said "I guess I don't know what is supposed to happen if say YOUSLAV appeared twice for 20055111 or whether that is even possible to occur" It's not possible for the term to appear twice with the number. His solution appears to work very well!
      I'm very pleased to say your solution is working. I remmed out the chomp statement and put in two file handling statements and now it's doing exactly what I need. I had to install the "Data::Dump" module. Thank you, this is great work and very slick!
Re: How do I remove duplicate numeric elements of an array and preserve alphabetic elements? -- oneliner
by Discipulus (Canon) on Jun 04, 2018 at 15:52 UTC
    Hello jzelkowsz and welcome to the monastery and to the wonderful world of Perl!

    As you already got useful answers I propose you a short version(pay attention to windows double quotes!):

    perl -F"," -lane "push @{$h{$F[0]}},$F[1]}{print map{$_.'|'.(join',',@ +{$h{$_}}).qq(\n)}keys %h" data.txt 20155505|YOUSLAV,YURT,TENWIMPL 20102741|WEDLOFOU,YOUSLAV,YURT,KUPLYSO,TENWIMPL 20011541|YOUSLAV,TENWIMPL 20011271|YOUSLAV,WUMARTHE 20147155|YOUSLAV,KUPLYSO,FRIMA 20055111|YOUSLAV,YURT,TENWIMPL

    See perlrun to get all these perl switches explained, but use -MO=Deparse to see the oneliner exploded and more readable (the curly braces in $F[1]}{print are a trick called esquimo greeting;):

    perl -MO=Deparse -F"," -lane "push @{$h{$F[0]}},$F[1]}{print map{$_.'| +'.(join',',@{$h{$_}}).qq(\n)}keys %h" BEGIN { $/ = "\n"; $\ = "\n"; } LINE: while (defined($_ = <ARGV>)) { chomp $_; our(@F) = split(/,/, $_, 0); push @{$h{$F[0]};}, $F[1]; } { print map({$_ . '|' . join(',', @{$h{$_};}) . "\n";} keys %h); } -e syntax OK

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      perl -F"," -lane "push @{$h{$F[0]}},$F[1]}{print map{$_.'|'.(join',',@{$h{$_}}).qq(\n)}keys %h" data.txt

      This definitely works too.

      June 8th 2018 Discipulus added code tags

Re: How do I remove duplicate numeric elements of an array and preserve alphabetic elements?
by LanX (Saint) on Jun 04, 2018 at 14:55 UTC
    This puzzles me,

    > preserve all alphabetic elements of the array but the duplicate numeric elements must be removed

    but I think what you want is to parse the data pairwise ($number, $name) and have a unique list of names per number.

    In this case I'd suggest building a hash of hashes (if original order doesn't matter). Just set

    $names_per_num{$number}{$name} = 1

    for each combination.

    After that you'll just need to iterate over all numbers and print the keys of the sub-hash to get your desired output.

    No code yet, we'd love to help you improving your attempts! :)

    HTH!

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    PS: and if original order matters just use the above HoH as a %seen filter while iterating the list.

      Rolf: Thank you very much for your reply! I will read the link you mentioned. I have found Perl Monks to be very helpful!
Re: How do I remove duplicate numeric elements of an array and preserve alphabetic elements?
by BillKSmith (Monsignor) on Jun 04, 2018 at 15:31 UTC
    Do not think of your problem as removing numeric data. Look at the problem as one of combining all the alpha data that belongs to the same numeric 'key'. In this view, store the data as a hash-of-arrays with the numbers as the keys.
    use strict; use warnings; use Autodie; open my $FH, '<', 'jzelkowsz.dat'; my %data; while (my $pair = do{ $/ = ', ';<$FH>}) { my ($numeric, $alpha) = split qr/,/, $pair; push @{$data{$numeric}}, $alpha; } foreach my $num (sort keys %data) { $" = ','; $\ = "\n"; print "$num|@{$data{$num}}"; }
    Bill
      Thank you, Bill. I appreciate a different way of looking at the problem.