donkost has asked for the wisdom of the Perl Monks concerning the following question:

I receive a text file that contains multiple copies of data strings. I need to be able to capture each unique string and assign it to a variable. Here is an example of the file:
NCLEGEND-11-20 NCLEGEND-11-20 NCLEGEND-1-10 NCLEGEND-1-10 NCLEGEND-1-10 NCLEGEND-1-20 NCLEGEND-1-20 .......
Each entry can be duplicated an unknown number of times. Also the numbers can be any number combination. The only thing that is consistent is the text NCLEGEND and the dashes. So what I need to do is capture each unique entry and assign to a variable. For example:
$nc1 = NCLEGEND-1-10 $nc2 = NCLEGEND-11-20 $nc3 = NCLEGEND-1-20 ...
Any advice on how to do this would be greatly appreciated. Thanks,
don...

Replies are listed 'Best First'.
Re: Capturing Unique Data
by kennethk (Abbot) on Aug 18, 2009 at 17:21 UTC
    This is a FAQ. For multiple good ideas how to do this (including code) once you've loaded your data into an array, see How can I remove duplicate elements from a list or array?

    In your post, you specifically request I need to be able to capture each unique string and assign it to a variable. Is there a specific and compelling reason to use an initially unknown number of scalars as opposed to an array or hash? You could do it with Symbolic references, but those will generally cause more headaches than they will solve. In fact, using them in this kind of case is generally considered a classic example of poor form.

      Thank you. An array would be easier. I nhave never used a hash, so I'll have to look that up. d...
        If you'll spend any significant amount of time working with Perl, you'll definitely want to learn how to effectively use hashes - Perl's hash implementation is one of its greatest features. For some introductory material, see Perl variable types.
        Scalar, hash, and array. If you're going to use perl, this is the bare minimum. As consolation, note there is a lack of the usual typing demons, so any complex data structure is free from hassle.
Re: Capturing Unique Data
by ikegami (Patriarch) on Aug 18, 2009 at 17:56 UTC
    my @nc; my %seen; while (<$fh>) { chomp; push @nc, $_ if !$seen{$_}++; }

    Or if your input is guaranteed to be sorted,

    my @nc; my $last; while (<$fh>) { chomp; push @nc, $last = $_ if !defined($last) || $last ne $_; }
      Thanks! This worked perfectly!
Re: Capturing Unique Data
by bichonfrise74 (Vicar) on Aug 19, 2009 at 01:22 UTC
    This might get you started.
    #!/usr/bin/perl use strict; my %record; $record{$_}++ while (<DATA> ); print sort keys %record; __DATA__ NCLEGEND-11-20 NCLEGEND-11-20 NCLEGEND-1-10 NCLEGEND-1-10 NCLEGEND-1-10 NCLEGEND-1-20 NCLEGEND-1-20
Re: Capturing Unique Data
by Marshall (Canon) on Aug 18, 2009 at 19:00 UTC
    I took a guess here since these 11-20 and 1-10 things look suspiciously like dates. Now maybe these are chapter numbers or something like that? I'm not sure.

    One point is that the normal alpha-numeric sort works if you have leading zero'es. Otherwise, the sort order will not be numeric. So I just added a leading zero for the single digits, this is what you would need to sort chapters or dates easily.

    Then I used a hash to count the number of occurences of each digit combo, sorted by that number combo and printed result.

    #!usr/bin/perl -w use strict; my %date_hash; my @data = qw( NCLEGEND-11-20 NCLEGEND-11-20 NCLEGEND-11-2 NCLEGEND-1-10 NCLEGEND-1-10 NCLEGEND-1-10 NCLEGEND-1-20 NCLEGEND-1-20); foreach my $line (@data) { chomp ($line); #needed if @data is a file handle $line =~ s/-(\d)-/-0$1-/; #add leading zero for month $line =~ s/-(\d)$/-0$1/; #add leading zero for date my $date = ($line =~ m/NCLEGEND\-([\d-]+)/)[0]; #get the "num" part $date_hash{$date}++; } foreach my $date (sort keys %date_hash) { print "$date $date_hash{$date}\n"; #just print "$date\n"; if no need for counter value } __END__ Prints: 01-10 3 01-20 2 11-02 1 11-20 2
    Update: This is such an important part of being easily to sort reports by date, that some "amplification" is justified: "2009-08-05" is FAR superior to just "2009-8-5", because the natural alpha sort order will do the right thing for the longer string with leading zero'es. For times, same thing goes: 01:25 is FAR better than 01:25 AM and if you mean 01:25 PM, use 24 hour time 13:25. "2009-08-05 13:25" can be sorted against "2008-08-15 01:25" with just the basic sort in Perl.
Re: Capturing Unique Data
by sanku (Beadle) on Aug 25, 2009 at 04:56 UTC
    open(FILE111,"vv.txt") or die $!; my @file111=<FILE111>; close FILE111; push(my @unique,grep {!$ss{$_}++} @file111); foreach $i(0 .. (scalar @unique - 1)){ $nc="nc".($i + 1); print "$nc = $unique[$i]\n"; }