comment on

With 700 lines you can easily read one file into a hash and then compare each row in the other file against that hash.

With larger files (1 MB and up), you may wish to save a lot of memory by noticing, that the files are sorted, alphabetically, it seems. Also, in this case most of the lines will be present in both files, so storing the differing rows will not consume insanious amounts of memory :-)

Here is an mergesortish way to do it:

=head1 compare_sorted_files_by_line($filename1, $filename2)

Finds lines that are present in only one of the files, whose names are
given as arguments. This function assumes that the lines in the files are
in alphabetical order.

Returns the unique rows in each file, in two list references. The first one
points to an array containing the rows that are present in $filename1 only,
and the second one similarly for $filename2.

Returns an empty list if either of the files could not be opened for reading.

=cut

sub compare_sorted_files_by_line( $$ )
{
    my($filename1, $filename2) = @_;

    my(@in1only, @in2only); # The unique rows ("matches") are stored in these

    unless(open(FILE1, "< $filename1"))
    { warn "$0: Could not open $filename1: $!\n"; return (); }
    unless(open(FILE2, "< $filename2"))
    { warn "$0: Could not open $filename2: $!\n"; close FILE1; return ();}

    my $line1 = <FILE1>;
    my $line2 = <FILE2>;

    while(defined($line1) and defined($line2))
    {
        my $compare = $line1 cmp $line2;
        if($compare == 0)
        {
            $line1 = <FILE1>;
            $line2 = <FILE2>;
            next;
        }
        elsif($compare > 0)
        {
            push(@in2only, $line2);
            $line2 = <FILE2>;
            next;
        }
        else
        {
            push(@in1only, $line1);
            $line1 = <FILE1>;
        }
    }
    # were there differences at end of file?
    if(defined($line1))
    {
        push(@in1only, $line1);
        push(@in1only, $_) while(<FILE1>);
    }
    if(defined($line2))
    {
        push(@in2only, $line2);
        push(@in2only, $_) while(<FILE2>);
    }
    close FILE1;
    close FILE2;

    # we happen to like strings without newlines.
    chomp(@in1only);
    chomp(@in2only);

    return(\@in1only, \@in2only);
}

-Bass

In reply to Re: Comparison Of Files by Anonymous Monk
in thread Comparison Of Files by ImpalaSS

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Welcome to the Monastery
	PerlMonks