Managing huge data in Perl

soura_jagat has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I'm new to Perl and new to this site too, so don't mind my silly question. I've a problem: There's two huge files (one is 37 MB, another is 5 MB), now all I have to do is, take a substring of each line from the big file, compare with a substring of all the lines in the small file, and whenever there's a match, the whole line of the big file appended with the some portion of the matching line from the small file goes to a separate file, say file1. If there isn't any match, the line goes to another file, say file2. Now using simpler and straight ways, I got this working, but the time took around 22-30 minutes. While an equivalent shell+awk+sed script took around 15 minutes. Now, I can't run the shell script in Windows, I got to do the stuff in Perl, so my question is: is there any way (or trick may be) by which I can decrease the total processing time considerably? And one more thing, its best if I could do this with the standard libraries that come with Perl, as installing extra ones on a restricted network could be a problem. Please help!

Comment on Managing huge data in Perl

Replies are listed 'Best First'.
Re: Managing huge data in Perl by BrowserUk (Patriarch) on Apr 17, 2009 at 14:52 UTC
I got this working, but the time took around 22-30 minutes. Hm. Then you coded it wrong. I just ran the following which read a 37MB file, extracts one word from the middle of each of the 380,000 lines and looks it up in a hash built from a 250,000 line/5MB file, and writes the matching/non-matching lines to two different files. No tricks and no modules. It use ~13MB of ram and takes under 4 seconds. Of course it doesn't do exactly what you want to do, but you expected to have to do some of the work yourself didn't you? #! perl -sw use 5.010; use strict; my %smallBits; open SMALL, '<', '758205.small' or die $!; $smallBits{ (split)[ 0 ] } = $_ while <SMALL>; close SMALL; open GOOD, '>', '758205.good' or die $!; open BAD, '>', '758205.bad' or die $!; open BIG, '<', '758205.big' or die $!; while( <BIG> ) { chomp; my $substr = (split)[ 5 ]; if( exists $smallBits{ $substr } ) { say GOOD "$_ : $smallBits{ $substr }"; } else { say BAD $_; } } close BIG; close GOOD; close BAD; __END__ [15:41:05.05] C:\test>758205.pl [15:41:08.85] C:\test>dir 758205.* Volume in drive C has no label. Volume Serial Number is 8C78-4B42 Directory of C:\test 17/04/2009 15:41 16,855 758205.bad 17/04/2009 15:26 37,200,976 758205.big 17/04/2009 15:41 46,067,745 758205.good 17/04/2009 15:40 529 758205.pl 17/04/2009 15:28 5,422,408 758205.small 5 File(s) 88,708,513 bytes 0 Dir(s) 419,386,847,232 bytes free [15:41:35.14] C:\test>wc -l 758205.small 266000 758205.small [15:43:32.02] C:\test>wc -l 758205.big 380000 758205.big [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: Managing huge data in Perl by tilly (Archbishop) on Apr 17, 2009 at 15:10 UTC
I'm trying to figure out why you required Perl 5.10, and am failing.	[reply]
Re^3: Managing huge data in Perl by BrowserUk (Patriarch) on Apr 17, 2009 at 15:14 UTC
To enable say. (And defined-or and... ) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: Managing huge data in Perl by soura_jagat (Initiate) on Apr 17, 2009 at 16:35 UTC
Re: Managing huge data in Perl by hsinclai (Deacon) on Apr 17, 2009 at 12:56 UTC
The technique described here Re: Muy Large File by BrowserUK works well for ripping through large files, maybe you can apply it to your problem. sysread, syswrite, and sysseek are all built in, so no modules required.	[reply]
Re: Managing huge data in Perl by marto (Cardinal) on Apr 17, 2009 at 13:04 UTC
Hi soura_jagat, Welcome to the Monastery. Those file sizes are not considered huge by modern standards. Have you written any Perl code to do this? If so please post what you have (see Writeup Formatting Tips). What does the content of these files look like? Hope this helps Martin	[reply]