Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Comparison Of Files

by ImpalaSS (Monk)
on Dec 05, 2000 at 20:11 UTC ( #45003=perlquestion: print w/replies, xml ) Need Help??

ImpalaSS has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a wierd question. I have 2 perl scripts, they do the exact same thing, except one creates a file called "newsite1.txt" and the other "newsite2.txt". And each one are run on alternating days. newsite1.pl runs on sunday, tuesday, thursday, saturday. What they each do, is read a directory, which has the names of all the cellsites in the nextel system. My boss wants me to create a tool that tells you which cellsite came up over night. So if someone clicks the button today (tuesday) they would get the sites that came up monday night through tuesday morning. Here is a little bit of what the text file looks like:
PHI_R10K_2 PHI_TDAP_3 de0040Newark de0042Wilmington de0053Christana de0053Christiana de0101Odessa de0102Clayton de0103Dover de0180Claymont de0184Wooddale de0187Woodcreek de0205ChestnutKnoll de0267Chapman de0314Glasgow de0348Millside de0371SouthDover

Only there happens to be about 700 of them. Now, my question is, how can i compare newsite1.txt to newsite2.txt and print the differences with most effenciency? i thought about opening the one file, and comparing every entry against the second file. But, if the second file has more cell names.. then some would get neglected.
Thanks in advance

Dipul

Replies are listed 'Best First'.
Re: Comparison Of Files
by chipmunk (Parson) on Dec 05, 2000 at 21:22 UTC
    Here's a solution that reports the changes in both directions, using a hash for each file. It has been tested.
    sub compare_sites { my($old, $new) = @_; if (-M $new > -M $old) { ($old, $new) = ($new, $old); } open(OLD, $old) or die "Can't open $old: $!\n"; open(NEW, $new) or die "Can't open $new: $!\n"; my(%old, %new); while (<OLD>) { chomp; $old{$_} = 1; } while (<NEW>) { chomp; $new{$_} = 1; } close(OLD); close(NEW); my @old = keys %old; delete @old{keys %new}; delete @new{@old}; for (sort keys %old) { print "$_ went down overnight.\n"; } for (sort keys %new) { print "$_ came up overnight.\n"; } }

      Ok. I stole a lot of the chipmunk's code (because I'm lazy and didn't want to type...). Instead, I trade speed for memory by using one hash for both old & new.

      sub compare_sites { my($old, $new) = @_; if (-M $new > -M $old) { ($old, $new) = ($new, $old); } open(OLD, $old) or die "Can't open $old: $!\n"; open(NEW, $new) or die "Can't open $new: $!\n"; my %hash; my $omsk = 1; my $nmsk = 2; while (<OLD>) { chomp; $hash{$_} |= $omsk; } while (<NEW>) { chomp; $hash{$_} |= $nmsk; } close(OLD); close(NEW); for (sort keys %hash) { if ($hash{$_} == 1) { print "$_ went down overnight.\n"; } } for (sort keys %hash) { if ($hash{$_} == 2) { print "$_ went up overnight.\n"; } } } compare_sites("newsite1.txt", "newsite2.txt");
      Now you could just go through the hash once and push into arrays as well, but then we might as well use the chipmunk hash. (Dance or Dish?)
Re: Comparison Of Files
by gaspodethewonderdog (Monk) on Dec 05, 2000 at 20:17 UTC
    well this is untested code, but maybe something like this:
    compare_sites("newsite1.txt", "newsite2.txt") if day eq "monday..." # psuedo code compare_sites("newsite2.txt", "newsite1.txt") if day eq "tuesday..." # psuedo code sub compare_sites { my $file1 = shift; my $file2 = shift; open IN1, $file1; while(<IN1>) { chomp; $newsite1{$_}++; } # while close IN1; open IN2, $file2; while(<IN2>) { chomp; print "$_ is a new site\n" if not defined($newsite1{$_}); } # while close IN2; } # compare_sites
    Hopefully memory isn't a problem... but otherwise this shouldn't be too bad a solutions

    UPDATE:

    changed the code to a function as per tye's suggestion (and so he doesn't think I'm trying to make him look like an idiot I added this comment)... and added psuedo code for calling the function based on dates... :P

    UPDATE:

    changed the does not exist line to print the site name and not a number... plus it identifies that it is a new site...

      One thing. I would change this:
      while(<IN2>) { chomp; print "$_ is a new site\n" if not defined($newsite1{$_}); } # while
      to this:
      while(<IN2>) { chomp; print "$_ is a new site\n" unless exists $newsite1{$_}; } # while
      Checking for the existence of a key is quicker than looking up the value of a hash element.
      How about something like this to choose the order of files:
      if( -M 'newsite1.txt' > -M 'newsite2.txt' ) { compare_sites('newsite1.txt', 'newsite2.txt') } else { compare_sites('newsite2.txt', 'newsite1.txt') }
      That way if for some reason the update doesn't run one day, you'll still be comparing the files in proper order.

      --
      I'd like to be able to assign to an luser

      Hey, i would use that kind of solution, but each file gets updates on alternating days. So on one day, newsite1.txt might be the newest file, and the other newsite2.txt might. I need a way to just see, if one name is in the other, and if its not.. print it, so if newsite2.txt has a site that newsite1.txt does not, print it, and viceversa.
      Thanks

      Dipul

        Then make it into a subroutine and reverse the arguments to it (the names of the files) depending on the day.

                - tye (but my friends call me "Tye")
Re: Comparison Of Files
by mdillon (Priest) on Dec 05, 2000 at 21:42 UTC

    since the data files are effectively presorted...

    ObAlgorithmDiff:
    #!/usr/bin/perl -w use strict; use Algorithm::Diff qw(traverse_sequences); die unless @ARGV == 2 and $ARGV[0] ne $ARGV[1]; my @files = @ARGV; my %data; push @{$data{$ARGV}}, $_ while <>; my $a = $data{$files[0]}; my $b = $data{$files[1]}; my (@additions, @deletions); traverse_sequences $a, $b, { DISCARD_A => sub { push @deletions, $a->[$_[0]] }, DISCARD_B => sub { push @additions, $b->[$_[0]] }, }; if (@deletions) { print "Deletions:", $/; print for @deletions; print $/; } if (@additions) { print "Additions:", $/; print for @additions; print $/; }
(tye)Re3: Comparison Of Files
by tye (Sage) on Dec 05, 2000 at 21:54 UTC

    My first idea was a "merge sort" (well, at least the "merge" part of it). Luckilly you say that the files are already sorted so this is easy. But I still don't think it is as easy as extending what gaspodethewonderdog came up with (since you said in the chatterbox that you wanted both additions and deletions).

    sub compare { my( $old, $new )= @_; open OLD, "< $old" or die "Can't read $old: $!\n"; open NEW, "< $new" or die "Can't read $new: $!\n"; my %old; while( <OLD> ) { chomp; $old{$_}++; } close OLD; my @new; while( <NEW> ) { chomp; push @new, $_ if delete $old{$_}; } close NEW; my @old= sort keys %old; print "New sites:\n\t", join("\n\t",@new), $/; print "Old sites:\n\t", join("\n\t",@old), $/; }

    Then use Alabanach's idea for deciding which order to compare the files in.

            - tye (but my friends call me "Tye")
Re: Comparison Of Files
by wardk (Deacon) on Dec 05, 2000 at 21:25 UTC

    Perl Cookbook Chapter 4.7 "Finding Elements in one array but not the other" is a solution that may help

    from the cookbook... "Build a hash of the keys of @B and use as a lookup table, then check each element in @A to see if it is @B."

    compare site1 to site2 then site2 to site1 to get both sets of missing sites.

Re: Comparison Of Files
by decnartne (Beadle) on Dec 05, 2000 at 20:43 UTC
    if you have no aversion to diff, how about:

    #!/usr/bin/perl -w use strict; open(INP, "/usr/bin/diff ./newsite1.txt ./newsite2.txt |") or die "pip +e: $!\n"; while (<INP>) { print substr($_,2) if (/^> /); print substr($_,2) if (/^< /); } close(INP);

    decnartne ~ entranced

      Two problems. First, you may need to sort both files before you do this. If the order of entries might change between days, then "diff" isn't a great solution.

      Second, you'll probably end up printing severals lines as being both added and deleted. "diff" isn't great at doing a set difference. It is looking for document edits and so can easily report a big chunk of the "bigger" file as being changed and then show the subset of that chunk that was already there in the "smaller file" (and didn't change).

              - tye (but my friends call me "Tye")
        The 2 files are created from a directory listing of all the sites in the system, soi they will automatiicaly be sorted, in exactly the same order, however, new sites will be placed within that order. So the files could look like this:
        newsite1:
        PHI_R10K_2 PHI_TDAP_3 de0040Newark de0042Wilmington de0053Christana de0053Christiana de0101Odessa de0102Clayton de0103Dover de0180Claymont de0184Wooddale de0187Woodcreek de0205ChestnutKnoll de0267Chapman de0314Glasgow de0348Millside de0371SouthDover
        newsite2:
        PHI_R10K_2 PHI_TDAP_3 de0040Newark de0042Wilmington de0045Concord # <======= new site de0053Christana de0053Christiana de0101Odessa de0102Clayton de0103Dover de0180Claymont de0184Wooddale de0187Woodcreek de0205ChestnutKnoll de0267Chapman de0314Glasgow de0348Millside de0371SouthDover
        at which point i would want de0045concord returned.
        I hope this helps.
        Thanks again


        Dipul
        ouch! you're right... i did some further testing, and let's just say it's pretty ugly...

        decnartne ~ entranced

Re: Comparison Of Files
by gt8073a (Hermit) on Dec 06, 2000 at 05:12 UTC
    My boss wants me to create a tool that tells you which cellsite came up over night. So if someone clicks the button today (tuesday) they would get the sites that came up monday night through tuesday morning.
    use Time::Local; my $sec = 0; my $min = 0; my $close = 17; ## 5:00 pm my $open = 9; ## 9:00 am my $spd = 24 * 60 * 60; my $yes = time - timelocal( $sec, $min, $close, (localtime( tim +e - $spd ))[ 3, 4, 5 ] ); my $morning = time - timelocal( $sec, $min, $open, (localtime)[ 3, +4, 5 ] ); my $dir = '/'; ## where the files reside( rem ending slash ) my @cellsites; opendir CELLSITES, $dir or die "opendir"; @cellsites = map{ $_->[0] } grep{ $_->[1] <= $yes && $_->[1] >= $morning } map{ [ $_, ( ( -M "$dir$_" ) * $spd ) ]} readdir CELLSITES; closedir CELLSITES; ## print @cellsites to a file ## or to screen, or something

      Nice first post. (: I liked the completely different approach and I liked the way you did it.

      I thought you might appreciate another alternative for the final part:

      @cellsites = grep { my $age= $spd * -M $dir.$_; $age <= $yes && $morning <= $age } readdir CELLSITES;
      which I think is a slight improvement. Keep up the good work.

              - tye (but my friends call me "Tye")
Re: Comparison Of Files
by chipmunk (Parson) on Dec 06, 2000 at 08:51 UTC
    If you decide to go with a utility solution rather than a Perl solution, an alternative to using `diff` is `comm`, which finds common lines in sorted files. Each line is put into one of three columns depending on whether it is in the first file, the second file, or both files. (There is no column for lines that are in neither file.) The command line arguments let you turn off columns you don't want.

    comm -23 newsite1.txt newsite2.txt will print lines that are only in newsite1.txt, and
    comm -13 newsite1.txt newsite2.txt will print lines that are only in newsite2.txt

    comm -3 newsite1.txt newsite2.txt will print lines that in are only in newsite1.txt in the first column, and lines that are only in newsite2.txt in the second column.

Re: Comparison Of Files
by extremely (Priest) on Dec 06, 2000 at 03:50 UTC
    This is off from the main question but why run two different scripts on odd days? Just run one script. Have it copy the older file back and then create a new file.
    use File::Copy; move '/path/newsite2.txt', '/path/newsite3.txt'; #just in case move '/path/newsite1.txt', '/path/newsite2.txt'; #Back up #create newsite1.txt

    Really, maintaining just one script is worth the extra line or two...

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: Comparison Of Files
by Anonymous Monk on Dec 05, 2000 at 23:46 UTC

    With 700 lines you can easily read one file into a hash and then compare each row in the other file against that hash.

    With larger files (1 MB and up), you may wish to save a lot of memory by noticing, that the files are sorted, alphabetically, it seems. Also, in this case most of the lines will be present in both files, so storing the differing rows will not consume insanious amounts of memory :-)

    Here is an mergesortish way to do it:
    =head1 compare_sorted_files_by_line($filename1, $filename2)
    
    Finds lines that are present in only one of the files, whose names are
    given as arguments. This function assumes that the lines in the files are
    in alphabetical order.
    
    Returns the unique rows in each file, in two list references. The first one
    points to an array containing the rows that are present in $filename1 only,
    and the second one similarly for $filename2.
    
    Returns an empty list if either of the files could not be opened for reading.
    
    =cut
    
    sub compare_sorted_files_by_line( $$ )
    {
        my($filename1, $filename2) = @_;
    
        my(@in1only, @in2only); # The unique rows ("matches") are stored in these
    
        unless(open(FILE1, "< $filename1"))
        { warn "$0: Could not open $filename1: $!\n"; return (); }
        unless(open(FILE2, "< $filename2"))
        { warn "$0: Could not open $filename2: $!\n"; close FILE1; return ();}
    
        my $line1 = <FILE1>;
        my $line2 = <FILE2>;
    
        while(defined($line1) and defined($line2))
        {
            my $compare = $line1 cmp $line2;
            if($compare == 0)
            {
                $line1 = <FILE1>;
                $line2 = <FILE2>;
                next;
            }
            elsif($compare > 0)
            {
                push(@in2only, $line2);
                $line2 = <FILE2>;
                next;
            }
            else
            {
                push(@in1only, $line1);
                $line1 = <FILE1>;
            }
        }
        # were there differences at end of file?
        if(defined($line1))
        {
            push(@in1only, $line1);
            push(@in1only, $_) while(<FILE1>);
        }
        if(defined($line2))
        {
            push(@in2only, $line2);
            push(@in2only, $_) while(<FILE2>);
        }
        close FILE1;
        close FILE2;
    
        # we happen to like strings without newlines.
        chomp(@in1only);
        chomp(@in2only);
    
        return(\@in1only, \@in2only);
    }
    
    -Bass
Re: Comparison Of Files
by 2501 (Pilgrim) on Dec 05, 2000 at 22:50 UTC
    I like chipmunk's idea to use hashes.
    read through both files basically saying:
    $masterlist{$fileline}=1;
    then do a foreach on the keys of masterlist to get all
    the records. Because hashes can't have dupe keys, you should be ok. From what you wrote of your attempts, it sounds like you are not as interested in REMOVING items, because time will take care of that.
Re: Comparison Of Files
by weingart (Acolyte) on Dec 05, 2000 at 23:31 UTC
    If they are already sorted,

    diff -u file1 file2 | sed -e '/^^+/d' -e 's/^+//'

    will do the job handily... Of course, using an oldie tool will do the job even easier, but only if the files are sorted first:

    comm -2 file1 file2

    --Toby.
Re: Comparison Of Files
by Anonymous Monk on Dec 06, 2000 at 03:02 UTC
    Use the diff command outputing the results to a file.
Re: Comparison Of Files
by belg4mit (Prior) on Dec 07, 2000 at 02:45 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://45003]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2023-06-10 18:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (39 votes). Check out past polls.

    Notices?