Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

One file contains values like 2000/01/03/aaa/xyz.xml 2002/04/01/bbb/abc.xml Second file contains 2000/01/03/xyz How to read the two files simultaneously and print the line if it is not present in file2 by spliting 4th field and .xml.
#!/usr/bin/perl $data_file="/tmp/input_ext_id.txt"; open(DAT, $data_file) || die("Could not open file!"); @datas=<DAT>; foreach $data(@datas){ $data_orig=$data; @data=split("/",$data); $data[4] =~ s/\.xml//ig; chomp($data_orig); push(@finalarr,"$data_orig#$data[0]/$data[1]/$data[2]/$data[4]"); push(@finalarr, "$data[0]/$data[1]/$data[2]/$data[4]"); } @sorted_list = sort(@finalarr);
redirected to a files. Then I am taking the difference of two files(first file after removing the 4th field and .xml) The difference of the lines is passed as <FILE2>.
@datas=<FILE2>; while ($line = <FILE1>) { $position=rindex($line,"#")+1; $lines = substr($line,$position); #print $lines; $lines =~ s/^\s//gi; if ( grep( /$lines/,@datas) ) { print "$line"; }
When the files contains the line of 230K, it takes too long time and prints out the message "OUT of memory" The output should print
2002/04/01/bbb/abc.xml

Replies are listed 'Best First'.
Re: Read the two files
by GrandFather (Saint) on Jul 22, 2009 at 11:45 UTC

    For large files (230K lines is at the small end of large) slurping files is a bad idea. In this case reading the second file first and building a lookup hash, then reading the first file and testing against the hash is probably the way to go. Consider:

    use strict; use warnings; my $file1Str = <<END_FILE1; 2000/01/03/aaa/xyz.xml 2002/04/01/bbb/abc.xml END_FILE1 my $file2Str = <<END_FILE2; 2000/01/03/xyz END_FILE2 my %lookup; open my $file2In, '<', \$file2Str or die ("Could not open file!"); while (my $line = <$file2In>) { my @parts = split '/', $line; my $key = join '/', @parts[0 .. 3]; ++$lookup{$key}; } close $file2In; open my $file1In, '<', \$file1Str or die ("Could not open file!"); while (my $line = <$file1In>) { my @parts = split '/', $line; $parts[-1] =~ s/\.xml//i; my $key = join '/', @parts[0 .. 2, 4]; next if exists $lookup{$key}; print $line; } close $file1In;

    Prints:

    2002/04/01/bbb/abc.xml

    Update Fixed wording - thanks jethro


    True laziness is hard work
Re: Read the two files
by jethro (Monsignor) on Jul 22, 2009 at 11:44 UTC

    If you have big files, don't slurp them in in one big chunk. Instead of

    @datas=<DAT>; foreach ...

    use

    while ($data= <DAT>) {

    The usual method to compare two files in perl is to read one file into a hash and then read the other file line by line and check if that line is in the hash

    If the hash gets too big to fit into memory (very likely in your case), a module like DBM::Deep will store your hash transparently into a file

Re: Read the two files
by rovf (Priest) on Jul 22, 2009 at 11:19 UTC

    I'm surprised that you get Out Of Memory with a file size of only 230K. Aside from this, I would start with the second file, build a hash of its entries (such as '2000/01/03/xyz'), then process the first file line by line (not slurping it into memory at once), and for each element in file 1, see whether or not it appears in the hash.

    -- 
    Ronald Fischer <ynnor@mm.st>
      I am getting an error because, I have stored in an array. It takes more than 5 hours to get the lines which is not in first file

        Well, that's why I have suggested not storing in an array.

        -- 
        Ronald Fischer <ynnor@mm.st>