TheFarsicle has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonks,

I am newbie to Perl & working on the Perl script to perform an action similar to V-Lookup.

So,

As an input I have some large sized text files around 200 MB. These text files are to be searched for all the records present in the another file, say Reference.txt (This file is normally not more than one MB)

I have written script to find all the lines present in these large sized files based on text (string values) in Reference.txt file. All the found records are then written into a new file per each large file iteration.

The script works fine for normal size like 30-40 MB but when it goes more than 100 MB or so. It throws out of memory error.

I have designed these operations as subroutine and calling them.

The code goes something like this...

open (FILE, $ReferenceFilePath) or die "Can't open file"; chomp (@REFFILELIST = (<FILE>)); open OUTFILE, ">$OUTPUTFILE" or die $!; foreach my $line (@REFFILELIST) { open (LARGEFILE, $LARGESIZEDFILE) or die "Can't open File"; while (<LARGEFILE>) { my $Result = index($_, $line); if ($Result > 0) { open(my $FDH, ">>$OUTPUTFILE"); print $FDH $_; } } close(LARGEFILE); } close(OUTFILE); close(FILE);

Can you please guide me on where I am going wrong and what would be the best way to address this issue?

Thanks in advance.

FR

Replies are listed 'Best First'.
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
by marinersk (Priest) on Apr 24, 2015 at 14:00 UTC

    I'd second GotToBTru on the file open/close thing.

    I had expected to see you slurping the files, but you only slurp the file containing the list of files to examine, so that's not it.

    Definitely need to close $FDH (>>$OUTPUTFILE) inside the same braces where it's opened. I'd be concerned about you possibly causing unexpected buffering, especially as you continually re-open it.

    Nothing else jumps out at me -- but then, it seems pretty evident you have quickly typed psuedo-Perl and not shown us your actual code. One small typo in your actual script could cause the issue and be completely hidden in this summarization.

    I'd suggest moving the close to its proper location, and if the problem persists, post actual code demonstrating the problem (with such large data files, we'll probably have to forego the usual request for input data. Maybe a few lines as a sample?

Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
by GotToBTru (Prior) on Apr 24, 2015 at 13:24 UTC

    Consider how, when and where you open and close files. I don't understand the purpose of $FDH. You repeatedly re-open LARGEFILE but don't close it until the end. I can't point to any thing I know is causing your memory error, but I see a general sloppiness, and if that isn't causing this error, it is going to cause another one later!

    Dum Spiro Spero
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
by thargas (Deacon) on Apr 24, 2015 at 17:52 UTC

    Rather than read $LARGEFILE once for each line in $REFFILELIST, wouldn't it be more efficient to read it once and check each line against each line of $REFFILELIST?

    Something like:

    open (FILE, $ReferenceFilePath) or die "Can't open file"; chomp (@REFFILELIST = (<FILE>)); close(FILE); open OUTFILE, ">$OUTPUTFILE" or die $!; open (LARGEFILE, $LARGESIZEDFILE) or die "Can't open File"; while (<LARGEFILE>) { foreach my $line (@REFFILELIST) { print OUTFILE $_ if (index($_, $line); } } close(LARGEFILE); close(OUTFILE);

    N.B. untested since the original is incomplete and doesn't provide any data.

      Hi, With a limited information given i have put together a script. Maybe you can use it to improve yours.

      use strict; use warnings; open( my $fh, '<', "input.txt" ) or die "Cannot open input file: $!"; chomp ( my @input_data = <$fh> ); close($fh); open( my $frh, '<', "reference.txt" ) or die "Cannot open reference fi +le: $!"; chomp ( my @ref_data = <$frh> ); close ($frh); my @output = map { my $value = $_; grep { $value eq $_ } @ref_data; } @input_data; open ( my $wh, '>', "output.txt" ) or die ( "Cannot open the output fi +le. $!"); print {$wh} $_ for @output; close($wh);

      Oh, sheesh, thargas -- your post made me realize I'd missed something basic in the original post. The first file he opens isn't the list of files -- it's the list of strings.

      On a gut I'd say he's buffering a Cartesian Product of lines per file x lines in REFFILE. Can't prove it without the actual source code -- but it sure would fit the memory consumption pattern being presented.

      This only enhances what everyone has been saying -- post the actual code, not this mock-up of it -- there's something structurally wrong and we'll need to see the steel to find the rust.

Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
by Marshall (Canon) on Apr 25, 2015 at 00:45 UTC
    I am not sure about what you are trying to accomplish.
    A few data lines and expected output would help quite a bit!
    Here is one problem that I see:
    #usr/perl -w use strict; my $inPath = "somepath"; my $outPath= "outfile"; my $largefile = "large file"; open (INFILE, "<", $path2File) or die "Can't open $path2File for read"; open (OUTFILE, ">", $outPath) or die "Can't open $outpath for write"; open (LARGE, ">",$largefile) or die "Can't open $largefile for write"; #I don't see why this is necessay! #do something here.... while (<INFILE>) { .... print OUTFILE "some_data\n"; } #in general, open the files that you #need to use at the beginning of the program and then #use those file handles. #A "re-open" of a file handle for append is very #"expensive" thing within a loop in terms of performance. #Don't do that.

      Thanks Marshall for the reply.

      Actually, The requirement is more like V-Lookup functionality in Excel. Except for two columns, we have two files.

      It means that, There is a file A which is large in size in range of 150-200 MB. This file A contains information about work orders (like Order No, Order Name, Supplier No, Supplier Name, Created Date...and so on)

      There is another file B, which contains only Supplier No for particular region. This file is generally less than 1 MB..around 700 KB something.

      Now, I have to write those records in file C (a new file, kind of output file) for which Supplier No in file B matches with Supplier No in file A.

      So, If you look at the code that I have written, I take the file B contents in a list & then for each Supplier No in file B, I iterate the large file A line by line & check if Supplier No is present in the line. If so, I write the line into file C.

      Can you please suggest now, where I am going wrong?

        I take the file B contents in a list & then for each Supplier No in file B, I iterate the large file A line by line & check if Supplier No is present in the line. If so, I write the line into file C.

        You are doing it the wrong way around. You are having to process your entire 200MB fileA, for every line in fileB. That's O(N2).

        Guessing your fileB contains 10-digit Supplier No records, that means your processing will end up reading 70,000 * 200MB ~= 14TeraBytes. (14,000GB). Very slow.

        Now invert your logic. Place the Supplier Nos from fileB into a hash.

        Then read a line from fileA, extract the Supplier No and look to see if it exists in the hash (O(1)), if it does, write a record to fileC.

        This way you read fileB once and fileA once. Just 201MB to read from disk, and ~ 70,000x faster.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        I think that I posted a relevant reply.

        In general you want to read the input file(s) once. That is because this is an "expensive" operation in terms of I/O performance.

        If you wind up with a scenario where for each line of an input file B, you have to re-read each line of input file A, that is very inefficient. And it will take a lot of MIPs (N*N).

Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
by Laurent_R (Canon) on Apr 24, 2015 at 17:51 UTC
    This is not immediately related to your question, but if we knew more about your data (both the large file and the reference file), we might be able to suggest a solution where you would not need to read the large file so many times, but only once, leading to much better performance.

    Je suis Charlie.
      This code will not compile - not enough code shown.
      use strict; use warnings; # REFFILELIST is not the same as $ReferenceFilePath
      What op said:
      open (FILE, $ReferenceFilePath) or die "Can't open file"; chomp (@REFFILELIST = (<FILE>));
      The correct way is to iterate over the opened input largefile file handle.

      while (<FILE>) { chomp; #do something }
        $ReferenceFilePath is the name of the file, FILE the file handler opened on this file and @REFFILELIST is the array in which to store the lines of this file. Even though I would rather use lexical file handler and three-argument syntax for open, I do not see the syntax of this part of the code to be really wrong (just outdated and slightly deprecated).

        I also do not consider storing reference data into an array to be wrong (a hash might be better, but we don't know enough about the data to be sure). But storing the data in an array and then looping on that array is not good, it would be better to iterate on the lines. My view is that it makes sense to store the reference data in memory if you then iterate on the large file and use the in-memory data structure to look up for something. But again, we don't know enough about the data and about the real intent of the program.

        Je suis Charlie.
Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
by Anonymous Monk on Apr 24, 2015 at 16:08 UTC

    Nothing jumps out at me either. But an easy change would be to open each file just once at the beginning, rather than every time you access it. Aside from moving the open() statements, the only change I think this entails is doing a

    tell LARGEFILE, 0, 0;

    at the beginning of the loop that traverses the records in @REFFILELIST.

    You might get a speed improvement by restructuring your script to only read through LARGEFILE once. But this changes the order of your output, and maybe you need the output in the order your script provides.

    This is not relevant to your problem as far as I can tell, but if you plan to develop your Perl skills, you might want to develop the habit of using three-argument open() and lexical file handles everywhere (like you did for your output file)

Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
by Anonymous Monk on Apr 24, 2015 at 17:47 UTC
    You have obviously provided some sanitized code here. Post some actual code that exhibits the issue.