Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

this is probably a easy questions but I have two files file1 with about 490,000 numbers and file2 with about 24,000. file2s 24,000 exists in the file1 with 490,000. i want to take these out of file of file1. that would leave 464,000 in file1. how would i do this? here is what ive tried.

#!/usr/bin/env perl use strict; open(FILE1, "num_consol.orig"); open(FILE2, "num_status_43.out"); open(OUT, ">>new_num.lst"); for $file1 (<FILE1>){ chomp $file1; open(FILE2, "num_status_43.out"); for file2 (<FILE2>){ chomp $file2; if ($file1 ne '$file2'){ $cnt++;<br> print OUT "$file1\n"; } } } close OUT;

Replies are listed 'Best First'.
Re: compare two files
by monarch (Priest) on Jul 24, 2007 at 13:18 UTC
    This is my guess of how to implement my interpretation of your query:
    use strict; # get parameters from command line my $fname1 = shift; my $fname2 = shift; my $fnameout = shift; # read in all numbers from file 2 into a hash open( FIN, "<$fname2" ) or die( "Cannot open $fname2: $!" ); my %exclude_num = (); while ( defined( my $line = <FIN> ) ) { # remove trailing newlines $line =~ s/[\r\n]+\z//s; # store number in hash $exclude_num{$line} = 1; } close( FIN ); # read in numbers from file 1, skipping excluded numbers open( FIN, "<$fname1" ) or die( "Cannot open $fname1: $!" ); open( FOUT, ">$fnameout" ) or die( "Cannot create $fnameout: $!" ); while ( defined( my $line = <FIN> ) ) { # remove trailing newlines $line =~ s/[\r\n]+\z//s; # skip excluded numbers next if ( $exclude_num{$line} ); print( FOUT "$line\n" ); } close( FOUT ); close( FIN );
Re: compare two files
by wojtyk (Friar) on Jul 24, 2007 at 13:48 UTC
    If you're trying to alter file1 to be the set complement of the two files, you could use this to do it on the fly and not create a temp file:
    use Tie::File; my %seen; tie my @file1, 'Tie::File', 'file1' or die; tie my @file2, 'Tie::File', 'file2' or die; foreach (@file2) { chomp; $seen{$_}++; } @file1 = grep { chomp; !$seen{$_} } @file1; untie(@file1); untie(@file2);

      Once you have both files as lists, you can use List::MoreUtils to test uniq-ness:

      use Tie::File; use List::MoreUtils qw(uniq); tie my @file1, 'Tie::File', 'file1' or die; tie my @file2, 'Tie::File', 'file2' or die; print join "\n",uniq (@file1,@file2);

      citromatik

Re: compare two files
by dsheroh (Monsignor) on Jul 24, 2007 at 14:46 UTC
    If the files are sorted, your quickest option will probably be to take the first line from each file, print the line from file1 if they're different or advance a line in file2 if they're the same, and then advance a line in file1. This is also very space-efficient, since you only need to have one line from each file in memory at a time.

    A basic implementation of this would be:

    Note that this implementation assumes that there are no values in file2 which are not also present in file1 and that neither file contains any duplicates.

    If the files are not sorted (and you're not going to be using them repeatedly), then a hash-based solution such as others have proposed would probably be faster than sorting them and using this method.

Re: compare two files
by citromatik (Curate) on Jul 24, 2007 at 15:33 UTC

    If you are only trying to do the job, a simple line of shell code is sufficient: (or 2 lines if the files are not sorted)

    # if the files are not sorted, sort them $ sort -k 1,1n file1 > file1.sorted $ sort -k 1,1n file2 | join -v 1 file1.sorted - > file1.uniq

    citromatik

Re: compare two files
by leocharre (Priest) on Jul 24, 2007 at 13:26 UTC

    I am guessing your files would be:

    File a:

    123123123

    File b:

    123123

    Where do you want to take the numbers from? The start, or the end? What generates these files? Are only \d digits present in the file? Should your program freak out if none digit characters are present?