in reply to Are two lines in the text file equal

If you want really fast (and have a lot of memory like 1-4GB) use a hash thusly:

my $log1 = '/var/log/httpd/access_log'; my $log2 = '/var/log/httpd/access_log.1'; my %hash; open FH1, $log1 or die $!; open FH2, $log2 or die $!; $hash{$_}++ while <FH1>; $hash{$_}++ while <FH2>; close FH1; close FH2; for (keys %hash) { print if $hash{$_} > 1; }

Any key with a count > 1 is a match.

Expect this to use about 200-400 MB of memory per million lines as a ballpark (at least double, maybe even triple the raw data size). With Perl you spend memory for speed. Because of the time stamps, ips, etc you may want to parse out the part you really want to see similar (which will also save memory as the keys will be much shorter).

cheers

tachyon

Replies are listed 'Best First'.
Re^2: Are two lines in the text file equal (!count)
by tye (Sage) on Nov 13, 2003 at 02:38 UTC
    Any key with a count > 1...

    ...might just have appeared in one file more than than once but never the other.

    my $log1 = '/var/log/httpd/acce­ss_log'; my $log2 = '/var/log/httpd/acce­ss_log.1'; my %hash; open FH, $log1 or die $!; $hash{$_}++ while <FH>; close FH; open FH, $log2 or die $!; while( <FH> ) { print if $hash{$_}; } close FH;

    At least, that is how I interpret the question.

    Update: BrowserUk has more singular words than I have plural words to go on and so is probably right. It appears tachyon also thought two files were involved, so I won't feel too bad.

                    - tye

      Am I the only one who interprets this question to relate to finding duplicates in a single file?

      ...a given text file....anywhere in the file...The file contains...

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

        Yes, the goal is to find the duplicated (triplicates??) if any. Unfortunately, the machine with 1-4 GB is not at my disposal, so generating a giant hash is not truly an option.