in reply to Re: Out of Memory Error : V-Lookup on Large Sized TEXT File
in thread Out of Memory Error : V-Lookup on Large Sized TEXT File

Thanks Marshall for the reply.

Actually, The requirement is more like V-Lookup functionality in Excel. Except for two columns, we have two files.

It means that, There is a file A which is large in size in range of 150-200 MB. This file A contains information about work orders (like Order No, Order Name, Supplier No, Supplier Name, Created Date...and so on)

There is another file B, which contains only Supplier No for particular region. This file is generally less than 1 MB..around 700 KB something.

Now, I have to write those records in file C (a new file, kind of output file) for which Supplier No in file B matches with Supplier No in file A.

So, If you look at the code that I have written, I take the file B contents in a list & then for each Supplier No in file B, I iterate the large file A line by line & check if Supplier No is present in the line. If so, I write the line into file C.

Can you please suggest now, where I am going wrong?

  • Comment on Re^2: Out of Memory Error : V-Lookup on Large Sized TEXT File

Replies are listed 'Best First'.
Re^3: Out of Memory Error : V-Lookup on Large Sized TEXT File
by BrowserUk (Patriarch) on May 03, 2015 at 09:32 UTC
    I take the file B contents in a list & then for each Supplier No in file B, I iterate the large file A line by line & check if Supplier No is present in the line. If so, I write the line into file C.

    You are doing it the wrong way around. You are having to process your entire 200MB fileA, for every line in fileB. That's O(N2).

    Guessing your fileB contains 10-digit Supplier No records, that means your processing will end up reading 70,000 * 200MB ~= 14TeraBytes. (14,000GB). Very slow.

    Now invert your logic. Place the Supplier Nos from fileB into a hash.

    Then read a line from fileA, extract the Supplier No and look to see if it exists in the hash (O(1)), if it does, write a record to fileC.

    This way you read fileB once and fileA once. Just 201MB to read from disk, and ~ 70,000x faster.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^3: Out of Memory Error : V-Lookup on Large Sized TEXT File
by Marshall (Canon) on Jun 07, 2015 at 12:18 UTC
    I think that I posted a relevant reply.

    In general you want to read the input file(s) once. That is because this is an "expensive" operation in terms of I/O performance.

    If you wind up with a scenario where for each line of an input file B, you have to re-read each line of input file A, that is very inefficient. And it will take a lot of MIPs (N*N).

Re^3: Out of Memory Error : V-Lookup on Large Sized TEXT File
by Marshall (Canon) on May 09, 2015 at 02:40 UTC