Read the two files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

One file contains values like 2000/01/03/aaa/xyz.xml 2002/04/01/bbb/abc.xml Second file contains 2000/01/03/xyz How to read the two files simultaneously and print the line if it is not present in file2 by spliting 4th field and .xml.

#!/usr/bin/perl

$data_file="/tmp/input_ext_id.txt";
open(DAT, $data_file) || die("Could not open file!");
@datas=<DAT>;

foreach $data(@datas){
$data_orig=$data;
@data=split("/",$data);
$data[4] =~ s/\.xml//ig;
chomp($data_orig);
push(@finalarr,"$data_orig#$data[0]/$data[1]/$data[2]/$data[4]");
push(@finalarr, "$data[0]/$data[1]/$data[2]/$data[4]");
}
@sorted_list = sort(@finalarr);
[download]

redirected to a files. Then I am taking the difference of two files(first file after removing the 4th field and .xml) The difference of the lines is passed as <FILE2>.

@datas=<FILE2>;

while ($line = <FILE1>) {
$position=rindex($line,"#")+1;
$lines = substr($line,$position);

#print $lines;

$lines =~ s/^\s//gi;


if ( grep( /$lines/,@datas) ) {
   print "$line";
}
[download]

When the files contains the line of 230K, it takes too long time and prints out the message "OUT of memory" The output should print

2002/04/01/bbb/abc.xml
[download]

Comment on Read the two files Select or Download Code

Replies are listed 'Best First'.
Re: Read the two files by GrandFather (Saint) on Jul 22, 2009 at 11:45 UTC
For large files (230K lines is at the small end of large) slurping files is a bad idea. In this case reading the second file first and building a lookup hash, then reading the first file and testing against the hash is probably the way to go. Consider: use strict; use warnings; my $file1Str = <<END_FILE1; 2000/01/03/aaa/xyz.xml 2002/04/01/bbb/abc.xml END_FILE1 my $file2Str = <<END_FILE2; 2000/01/03/xyz END_FILE2 my %lookup; open my $file2In, '<', \$file2Str or die ("Could not open file!"); while (my $line = <$file2In>) { my @parts = split '/', $line; my $key = join '/', @parts[0 .. 3]; ++$lookup{$key}; } close $file2In; open my $file1In, '<', \$file1Str or die ("Could not open file!"); while (my $line = <$file1In>) { my @parts = split '/', $line; $parts[-1] =~ s/\.xml//i; my $key = join '/', @parts[0 .. 2, 4]; next if exists $lookup{$key}; print $line; } close $file1In; [download] Prints: `2002/04/01/bbb/abc.xml` [download] Update Fixed wording - thanks jethro True laziness is hard work	[reply] [d/l] [select]
Re: Read the two files by jethro (Monsignor) on Jul 22, 2009 at 11:44 UTC
If you have big files, don't slurp them in in one big chunk. Instead of `@datas=<DAT>; foreach ...` use `while ($data= <DAT>) {` The usual method to compare two files in perl is to read one file into a hash and then read the other file line by line and check if that line is in the hash If the hash gets too big to fit into memory (very likely in your case), a module like DBM::Deep will store your hash transparently into a file	[reply] [d/l] [select]
Re: Read the two files by rovf (Priest) on Jul 22, 2009 at 11:19 UTC
I'm surprised that you get Out Of Memory with a file size of only 230K. Aside from this, I would start with the second file, build a hash of its entries (such as `'2000/01/03/xyz'`), then process the first file line by line (not slurping it into memory at once), and for each element in file 1, see whether or not it appears in the hash. -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l] [select]
Re^2: Read the two files by Anonymous Monk on Jul 22, 2009 at 11:45 UTC
I am getting an error because, I have stored in an array. It takes more than 5 hours to get the lines which is not in first file	[reply]
Re^3: Read the two files by rovf (Priest) on Jul 22, 2009 at 11:46 UTC
Well, that's why I have suggested not storing in an array. -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l]