Dealing with large files in Perl

tester786 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Dealing with large files in Perl by salva (Canon) on May 15, 2005 at 21:30 UTC
break the files in smaller sorted ones and then use a merge sort to merge the files unifying duplicated entries when required. I released Sort::Key::Merger recently on CPAN, it can mergesort data from several files efficiently: `use Sort::Key::Merger qw(filekeymerger); my @last; my $sorted = filekeymerger { (split)[0] } @file1_parts, @file2_parts; while(defined(my $line=&$sorted)) { chomp($line); my ($id, $up, $down) = split ' ', $line; if (!@last or $last[0] eq $id) { push @last, $id, $up, $down; } else { print "@last $id $up $down\n"; @last=(); } } print "@last" if @last;` [download]	[reply] [d/l]
Re: Dealing with large files in Perl by Animator (Hermit) on May 15, 2005 at 21:18 UTC
Some comments about this code: First of all I find `open (FILTER, "$out1") or die ("cannot open input file out1\n")` more readable then the code you have Second, you really should be using `open (FILTER, "<", $out1)`, and perhaps you should included $! in your die-string, it contains the error... (such as file not found, permission denied, ...) Third, intending your code makes it much more readble (or did something go wrong in pasting it here?), perhaps you should read the perlstyle POD... Now the problem itself: You read FILTER, you split it, but you don't do anything with it... And it might be easier if you post some sample input and output... and is there a certain order in the file? and/or is one file usually smaller then the other one? or are they about the same size?)	[reply] [d/l] [select]
Re: Dealing with large files in Perl by chas (Priest) on May 15, 2005 at 21:43 UTC
There are some things I don't understand in your code (such as what $newfile is used for), but perhaps that's because not all the code is shown. The main thing I see is that when you are looping over $out1 you are constantly overwriting the variables $myid, etc, with your split so you are going to get results concerning the last line only...and I don't see any attempt to search for a line of a given form (containing some "unique value".) chas	[reply]
Re: Dealing with large files in Perl by graff (Chancellor) on May 16, 2005 at 04:47 UTC
I think I understand this part of your description: if value exist in file1 and found in file2 than take both lines containing that value and merge into one single file But there's nothing in your code to support this kind of operation. What is the value that you're looking for in the two input files? It looks like both input files are lists of table-like data, with three fields per line ("id up down"). If the value you're looking for is unique to one "cell" in each table (that is, it would occur only once per input file, if at all), then you're really talking about doing a "grep" operation. In fact, if you're using a unix-like OS, just use the "grep" command-line utility; if you're using MS-Windows, there are versions of grep available for free. But if you want to see how it's done in perl, here's one way: `#!/usr/bin/perl use strict; my $Usage = "Usage: $0 value file1 file2\n"; die $Usage unless ( @ARGV == 3 and -f $ARGV[1] and -f $ARGV[2] ); my $value = shift; # removes first element from @ARGV my @match; # will hold matching line from each file for my $file ( @ARGV ) { # loop over remaining two ARG's open( IN, $file ) or die "$file: $!"; while (<IN>) { if ( /$value/ ) { chomp; push @match, $_; last; } } } print join( " ", @match ), "\n";` [download] Now, if you were to try using the unix "grep" command, it would be: `grep value file1 file2` [download] Note that both the perl script and the grep command shown above will output the matching lines to STDOUT (the grep command will not join the two into a single line -- it will also include the file name at the beginning of each line, to show where the line came from). Also, your output might not be what you expect, if the value you're searching for contains characters that have special meanings in a regex (period, plus-sign, asterisk, question mark, brackets, braces, parens, "^", "$", "@" or "%", some others, depending on context). For such things, put "\Q" and "\E" around $value in the perl script. If you want the matches to be saved in a separate file, just use redirection on the command line: `perlscript value file1 file2 > matched.lines # or grep value file1 file2 > matched.lines` [download]	[reply] [d/l] [select]
Re^2: Dealing with large files in Perl by tester786 (Initiate) on May 16, 2005 at 05:38 UTC
You're absolutely right. your code is matches exactly what I'm looking for however, not getting the result I suspect. so here's the output after executing what you listed. <snip> 00e06f16b25 41000 306000 00112f9486bf 412 1696 </snip> what I'm looking for is searching for this value 00e06f16b25 and match with file2, than take both matching lines from file1 and file2 and merge it to file3. so the result should be: 00e06f16b25 41000 306000 00e06f16b25 389 5000	[reply]
Re^3: Dealing with large files in Perl by graff (Chancellor) on May 16, 2005 at 21:43 UTC
If you really ran the code exactly as I posted it, and your first command-line arg (assigned to $value) was really "00e06f16b25", then I just don't see how you could come up with the output that you cited inside your "snip" tags. Please double-check that you didn't alter the code, and that you ran it as intended. But now that you have provided more information about your data -- that the value you want to match is the first token on each data line, and this consists of a long hex number -- you can speed things up and make it more trustworthy by using "substr" and "eq" instead of a regex match: use strict; my $Usage = "Usage: $0 value file1 file2\n"; die $Usage unless ( @ARGV == 3 and -f $ARGV[1] and -f $ARGV[2] ); my $value = shift; # removes first element from @ARGV my $chklen = length( $value ); my @match; # will hold matching line from each file for my $file ( @ARGV ) { # loop over remaining two ARG's open( IN, $file ) or die "$file: $!"; while (<IN>) { if ( substr( $_, 0, $chklen ) eq $value ) { chomp; push @match, $_; last; } } close IN; # (this was implicit in the earlier version) } print join( " ", @match ), "\n"; [download] Note that in either version, if the value you provide on the command line turns out to be shorter than the initial hex number on each line of the input files, there's a chance that you'll get a "false alarm" match. For example, in the initial regex version, if the search value on the command line was just "6b" or "00", this could explain why the record from the second file was not right -- "6b" and "00" are found in both records.	[reply] [d/l]
Re^4: Dealing with large files in Perl by tester786 (Initiate) on May 17, 2005 at 20:02 UTC
finding highest and lowest number by tester786 (Initiate) on May 23, 2005 at 23:24 UTC
Re^5: Dealing with large files in Perl by jZed (Prior) on May 23, 2005 at 23:33 UTC
Some notes below your chosen depth have not been shown here