afasch01 has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract unique values from one file and create a new file with these values only.
2 Questions,
one...How do I specify unique vaules, the script below pulls out the correct info, but does not remove duplicate values. My search pattern looks for data that preceeded by a tab then a "C" then one or more "M" then any number of digits and returns only the digits, some digits are duplicated I need only unique numbers

and two...I can get the file renamed but I cannot get only the unique data into the file, all the data is moved into the file, I only need the selected data.

Here is the script, (be kind I am just learning perl)

$file = "project.txt"; open(FILE,"$file"); open(OUTPUT,">> project.out"); while(my $a=<FILE>) { if($a=~/\tCM+(\d*)/io) { print "$1\n"; print OUTPUT $a; } } close FILE; close OUTPUT;

update (broquaint): added formatting

Replies are listed 'Best First'.
Re: creating a new file with unique values
by Gilimanjaro (Hermit) on Jan 20, 2003 at 17:42 UTC
    Assuming the order of the occurrences doesn't matter, the easiest way would be to store the digits as keys in a hash. Keys are always unique. You could use any value as your hash-value;

    $file = "project.txt"; open(FILE,"$file"); my %hash = (); while(my $a=<FILE>) { if($a=~/\tCM+(\d*)/io) { print "$1\n"; $hash{$1}=undef; } } close FILE; open(OUTPUT,">> project.out"); while(my $key = each %hash) { print OUTPUT "$key\n"; } close OUTPUT;
    Or how I would probably write it:

    my %done=(); open INPUT, "<$inputfile"; open OUTPUT, ">>$outputfile"; while(<INPUT>) { next unless /\tCM+(\d+)/io; next if exists $done{$1}; print OUTPUT, "$1\n"; $done{$1}=undef; ) close INPUT; close OUTPUT;

    Haven't tested it, but it should work, and be fast... The <HANDLE> operator and regular expressions have the useful property that they both use the default variable ($_) if you supply none. This way you can bypass the use of $a.

Re: creating a new file with unique values
by BrowserUk (Patriarch) on Jan 20, 2003 at 17:55 UTC

    perl -nle "print $1 if /^CM(\d+)/i and ++$h{$1} ==1; " numbers >unique

    (Adjust quotes to system reqs.)


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: creating a new file with unique values
by hardburn (Abbot) on Jan 20, 2003 at 17:35 UTC

    First, please use <CODE> tags around your code. Second, always use strict and warnings

    Now for your problem. If you have the memory, put your data into a hash. Don't print it to the output file until you've read all the data from the input file.

    my $file = "project.txt"; # Use three-argument form of open() (available in perl 5.6.0) # and check the return value. open(FILE, '<', $project) or die "Can't open $project: $!\n"; my %input; while(my $line = <FILE>) { chomp $line; # Get rid of whitespace (newline) at the end of the str +ing if($line =~ /\tCM+(\d*)/io) { $input{$1}++; } } close(FILE);

    After the above code runs, %input will contain the digits as keys, with the value being the number of times that key shows up in the input file. Printing to the output file is even easier. Just check if the value of in the %input hash is greater than 1 before printing:

    open(OUT, '>>', 'project.out') or die "Can't open project.out for writ +ing: $!\n"; foreach my $i (keys %input) { print OUT "$i\n" unless $input{$i} > 1; } close OUT;
      Correction
      open(FILE, '<', $file) or die "Can't open $file: $!\n";
      poj
      Why would you want to delay writing the file? If no hash-entry exists yet, you know it can be written anyway...

      Also using foreach/keys to loop thru a hash if very inefficient, especially with big hashes; perl has to traverse the entire hash to collect all the keys, and when obtaining the value in the loop body, it has to look-up the key in the hash again.

      The preferred method would be to use a while/each loop. Your code would then look like:

      while(my ($i,$count) = each %input) { print OUT "$i\n" unless $count>1; }

      Using the while/each construct would also be a lot cleaner if the hash happened to be something like a tied database query result hash, if said hash supported database row cursors... But that actually has nothing to do with the topic... :)

      Happy coding, G.

Re: creating a new file with unique values
by jmcnamara (Monsignor) on Jan 20, 2003 at 18:08 UTC

    Here is a one-liner that should do it, the command line options are explained in perlrun:
    perl -lne '/\tCM+(\d+)/; print $1 if defined $1 and not $seen{$1}++' f +ile1 > file2

    --
    John.

Re: creating a new file with unique values
by afasch01 (Initiate) on Jan 23, 2003 at 22:30 UTC
    Thank you all for your help! It's not very pretty, it could use some work, but...it actually DOES work! Here is my final if anyone is interested, and thanks again! my %hash=(); open INPUT, "<project.txt"; open OUTPUT, ">>project.out"; while(<INPUT>) { next unless /PW\#/io; next if exists $hash{$'}; print OUTPUT "$'"; $hash{$'}=undef; } close INPUT; close OUTPUT; afasch01