Out of Memory Error : V-Lookup on Large Sized TEXT File

TheFarsicle has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File by marinersk (Priest) on Apr 24, 2015 at 14:00 UTC
I'd second GotToBTru on the file open/close thing. I had expected to see you slurping the files, but you only slurp the file containing the list of files to examine, so that's not it. Definitely need to close $FDH (>>$OUTPUTFILE) inside the same braces where it's opened. I'd be concerned about you possibly causing unexpected buffering, especially as you continually re-open it. Nothing else jumps out at me -- but then, it seems pretty evident you have quickly typed psuedo-Perl and not shown us your actual code. One small typo in your actual script could cause the issue and be completely hidden in this summarization. I'd suggest moving the close to its proper location, and if the problem persists, post actual code demonstrating the problem (with such large data files, we'll probably have to forego the usual request for input data. Maybe a few lines as a sample?	[reply]
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File by GotToBTru (Prior) on Apr 24, 2015 at 13:24 UTC
Consider how, when and where you open and close files. I don't understand the purpose of $FDH. You repeatedly re-open LARGEFILE but don't close it until the end. I can't point to any thing I know is causing your memory error, but I see a general sloppiness, and if that isn't causing this error, it is going to cause another one later! Dum Spiro Spero	[reply]
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File by thargas (Deacon) on Apr 24, 2015 at 17:52 UTC
Rather than read `$LARGEFILE` once for each line in `$REFFILELIST`, wouldn't it be more efficient to read it once and check each line against each line of `$REFFILELIST`? Something like: `open (FILE, $ReferenceFilePath) or die "Can't open file"; chomp (@REFFILELIST = (<FILE>)); close(FILE); open OUTFILE, ">$OUTPUTFILE" or die $!; open (LARGEFILE, $LARGESIZEDFILE) or die "Can't open File"; while (<LARGEFILE>) { foreach my $line (@REFFILELIST) { print OUTFILE $_ if (index($_, $line); } } close(LARGEFILE); close(OUTFILE);` [download] N.B. untested since the original is incomplete and doesn't provide any data.	[reply] [d/l] [select]
Re^2: Out of Memory Error : V-Lookup on Large Sized TEXT File by lonewolf28 (Beadle) on Apr 24, 2015 at 22:41 UTC
Hi, With a limited information given i have put together a script. Maybe you can use it to improve yours. `use strict; use warnings; open( my $fh, '<', "input.txt" ) or die "Cannot open input file: $!"; chomp ( my @input_data = <$fh> ); close($fh); open( my $frh, '<', "reference.txt" ) or die "Cannot open reference fi +le: $!"; chomp ( my @ref_data = <$frh> ); close ($frh); my @output = map { my $value = $_; grep { $value eq $_ } @ref_data; } @input_data; open ( my $wh, '>', "output.txt" ) or die ( "Cannot open the output fi +le. $!"); print {$wh} $_ for @output; close($wh);` [download]	[reply] [d/l]
Re^2: Out of Memory Error : V-Lookup on Large Sized TEXT File by marinersk (Priest) on Apr 25, 2015 at 03:02 UTC
Oh, sheesh, thargas -- your post made me realize I'd missed something basic in the original post. The first file he opens isn't the list of files -- it's the list of strings. On a gut I'd say he's buffering a Cartesian Product of lines per file x lines in REFFILE. Can't prove it without the actual source code -- but it sure would fit the memory consumption pattern being presented. This only enhances what everyone has been saying -- post the actual code, not this mock-up of it -- there's something structurally wrong and we'll need to see the steel to find the rust.	[reply]
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File by Marshall (Canon) on Apr 25, 2015 at 00:45 UTC
I am not sure about what you are trying to accomplish. A few data lines and expected output would help quite a bit! Here is one problem that I see: #usr/perl -w use strict; my $inPath = "somepath"; my $outPath= "outfile"; my $largefile = "large file"; open (INFILE, "<", $path2File) or die "Can't open $path2File for read"; open (OUTFILE, ">", $outPath) or die "Can't open $outpath for write"; open (LARGE, ">",$largefile) or die "Can't open $largefile for write"; #I don't see why this is necessay! #do something here.... while (<INFILE>) { .... print OUTFILE "some_data\n"; } #in general, open the files that you #need to use at the beginning of the program and then #use those file handles. #A "re-open" of a file handle for append is very #"expensive" thing within a loop in terms of performance. #Don't do that. [download]	[reply] [d/l]
Re^2: Out of Memory Error : V-Lookup on Large Sized TEXT File by TheFarsicle (Initiate) on May 03, 2015 at 07:29 UTC
Thanks Marshall for the reply. Actually, The requirement is more like V-Lookup functionality in Excel. Except for two columns, we have two files. It means that, There is a file A which is large in size in range of 150-200 MB. This file A contains information about work orders (like Order No, Order Name, Supplier No, Supplier Name, Created Date...and so on) There is another file B, which contains only Supplier No for particular region. This file is generally less than 1 MB..around 700 KB something. Now, I have to write those records in file C (a new file, kind of output file) for which Supplier No in file B matches with Supplier No in file A. So, If you look at the code that I have written, I take the file B contents in a list & then for each Supplier No in file B, I iterate the large file A line by line & check if Supplier No is present in the line. If so, I write the line into file C. Can you please suggest now, where I am going wrong?	[reply]
Re^3: Out of Memory Error : V-Lookup on Large Sized TEXT File by BrowserUk (Patriarch) on May 03, 2015 at 09:32 UTC
I take the file B contents in a list & then for each Supplier No in file B, I iterate the large file A line by line & check if Supplier No is present in the line. If so, I write the line into file C. You are doing it the wrong way around. You are having to process your entire 200MB fileA, for every line in fileB. That's O(N²). Guessing your fileB contains 10-digit Supplier No records, that means your processing will end up reading 70,000 * 200MB ~= 14TeraBytes. (14,000GB). Very slow. Now invert your logic. Place the Supplier Nos from fileB into a hash. Then read a line from fileA, extract the Supplier No and look to see if it exists in the hash (O(1)), if it does, write a record to fileC. This way you read fileB once and fileA once. Just 201MB to read from disk, and ~ 70,000x faster. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^3: Out of Memory Error : V-Lookup on Large Sized TEXT File by Marshall (Canon) on Jun 07, 2015 at 12:18 UTC
I think that I posted a relevant reply. In general you want to read the input file(s) once. That is because this is an "expensive" operation in terms of I/O performance. If you wind up with a scenario where for each line of an input file B, you have to re-read each line of input file A, that is very inefficient. And it will take a lot of MIPs (N*N).	[reply]
Re^3: Out of Memory Error : V-Lookup on Large Sized TEXT File by Marshall (Canon) on May 09, 2015 at 02:40 UTC
I think that BrowserUk is right.	[reply]
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File by Laurent_R (Canon) on Apr 24, 2015 at 17:51 UTC
This is not immediately related to your question, but if we knew more about your data (both the large file and the reference file), we might be able to suggest a solution where you would not need to read the large file so many times, but only once, leading to much better performance. Je suis Charlie.	[reply]
Re^2: Out of Memory Error : V-Lookup on Large Sized TEXT File by Marshall (Canon) on Apr 27, 2015 at 23:52 UTC
This code will not compile - not enough code shown. `use strict; use warnings; # REFFILELIST is not the same as $ReferenceFilePath` [download] What op said: `open (FILE, $ReferenceFilePath) or die "Can't open file"; chomp (@REFFILELIST = (<FILE>));` [download] The correct way is to iterate over the opened input largefile file handle. `while (<FILE>) { chomp; #do something }` [download]	[reply] [d/l] [select]
Re^3: Out of Memory Error : V-Lookup on Large Sized TEXT File by Laurent_R (Canon) on Apr 28, 2015 at 06:17 UTC
`$ReferenceFilePath` is the name of the file, `FILE` the file handler opened on this file and `@REFFILELIST` is the array in which to store the lines of this file. Even though I would rather use lexical file handler and three-argument syntax for open, I do not see the syntax of this part of the code to be really wrong (just outdated and slightly deprecated). I also do not consider storing reference data into an array to be wrong (a hash might be better, but we don't know enough about the data to be sure). But storing the data in an array and then looping on that array is not good, it would be better to iterate on the lines. My view is that it makes sense to store the reference data in memory if you then iterate on the large file and use the in-memory data structure to look up for something. But again, we don't know enough about the data and about the real intent of the program. Je suis Charlie.	[reply] [d/l] [select]
Re^4: Out of Memory Error : V-Lookup on Large Sized TEXT File by Marshall (Canon) on Apr 30, 2015 at 03:59 UTC
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File by Anonymous Monk on Apr 24, 2015 at 16:08 UTC
Nothing jumps out at me either. But an easy change would be to open each file just once at the beginning, rather than every time you access it. Aside from moving the open() statements, the only change I think this entails is doing a `tell LARGEFILE, 0, 0;` at the beginning of the loop that traverses the records in `@REFFILELIST`. You might get a speed improvement by restructuring your script to only read through LARGEFILE once. But this changes the order of your output, and maybe you need the output in the order your script provides. This is not relevant to your problem as far as I can tell, but if you plan to develop your Perl skills, you might want to develop the habit of using three-argument `open()` and lexical file handles everywhere (like you did for your output file)	[reply] [d/l] [select]
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File by Anonymous Monk on Apr 24, 2015 at 17:47 UTC
You have obviously provided some sanitized code here. Post some actual code that exhibits the issue.	[reply]