comment on

With 700 lines you can easily read one file into a hash and then compare each row in the other file against that hash.

With larger files (1 MB and up), you may wish to save a lot of memory by noticing, that the files are sorted, alphabetically, it seems. Also, in this case most of the lines will be present in both files, so storing the differing rows will not consume insanious amounts of memory :-)

Here is an mergesortish way to do it:

=head1 compare_sorted_files_by_line($filename1, $filename2)

Finds lines that are present in only one of the files, whose names are
given as arguments. This function assumes that the lines in the files are
in alphabetical order.

Returns the unique rows in each file, in two list references. The first one
points to an array containing the rows that are present in $filename1 only,
and the second one similarly for $filename2.

Returns an empty list if either of the files could not be opened for reading.

=cut

sub compare_sorted_files_by_line( $$ )
{
    my($filename1, $filename2) = @_;

    my(@in1only, @in2only); # The unique rows ("matches") are stored in these

    unless(open(FILE1, "< $filename1"))
    { warn "$0: Could not open $filename1: $!\n"; return (); }
    unless(open(FILE2, "< $filename2"))
    { warn "$0: Could not open $filename2: $!\n"; close FILE1; return ();}

    my $line1 = <FILE1>;
    my $line2 = <FILE2>;

    while(defined($line1) and defined($line2))
    {
        my $compare = $line1 cmp $line2;
        if($compare == 0)
        {
            $line1 = <FILE1>;
            $line2 = <FILE2>;
            next;
        }
        elsif($compare > 0)
        {
            push(@in2only, $line2);
            $line2 = <FILE2>;
            next;
        }
        else
        {
            push(@in1only, $line1);
            $line1 = <FILE1>;
        }
    }
    # were there differences at end of file?
    if(defined($line1))
    {
        push(@in1only, $line1);
        push(@in1only, $_) while(<FILE1>);
    }
    if(defined($line2))
    {
        push(@in2only, $line2);
        push(@in2only, $_) while(<FILE2>);
    }
    close FILE1;
    close FILE2;

    # we happen to like strings without newlines.
    chomp(@in1only);
    chomp(@in2only);

    return(\@in1only, \@in2only);
}

-Bass

In reply to Re: Comparison Of Files by Anonymous Monk
in thread Comparison Of Files by ImpalaSS

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.