in reply to Re: segmentation fault (core dumped!)
in thread segmentation fault (core dumped!)

my input file (2.txt) is nearly 3 gb and and the 1.txt would be nearly 1 gb.
  • Comment on Re^2: segmentation fault (core dumped!)

Replies are listed 'Best First'.
Re^3: segmentation fault (core dumped!)
by davido (Cardinal) on Jul 03, 2012 at 05:49 UTC

    The code, as you have it, is reading the entire "2.txt" file into memory, and then making another copy of it in memory as it's converted from an array to a scalar. So your memory footprint is a lot bigger than it has to be. But depending on your system, it may not help to simply avoid making that second copy. You may need to come up with an algorithm that doesn't pull the entire 3gb file into memory all at once.

    Here are three distinct alternatives that you might consider:

    • Find a way to process 2.txt in chunks.
    • As you read 2.txt into memory convert ATGC from bytes to bits; if A=0, T=1, G=2, C=3, then you can store each character position in two bits instead of eight.
    • Keep 2.txt on disk, and do a lot of seeking and telling.

    There are surely other strategies, but these are at least options you can consider.

    Each of these has implications with respect to complexity and performance. You know more about your problem than we do, and frankly, I'm not too interested in implementing a seek/tell or transcoding solution for you. But both are possible (albeit a pain in the backside).


    Dave

      using sed will make chunks in few minutes but the thg is i need to have entire data. and the server has 512gb memory. is there any priblem if i am storing everythg (2.txt containg 3 gb of data) to single scalar variable? can u just check my code

        Well, I thought it was clear that I did check your code. The issue of not checking the return value of open could be allowing a slient failure, but that wouldn't be anything like a core dump. I asked what error message you were getting, and you haven't answered that yet. I'm assuming, given the size of the files, that you're getting an "Out of memory!" error. Even if the server has copious amounts of RAM, a 32 bit build of Perl can't address more than 2gb (I think). A 64bit build shouldn't have that restriction. So you could probably get your script to run under a 64 bit Perl if it's built right.

        I provided a suggestion for minimizing the memory footprint (I even supplied some code demonstrating how), by eliminating a second in-memory copy of the data, and by storing the large file only in a single scalar rather than in an array and a scalar. That's a bigger savings than you might think, because each array element consumes as much memory as a scalar (which is more than a dozen bytes each). By eliminating the array altogether and holding the data in a single scalar you're reducing your memory footprint to about the same size as the file itself, plus a relatively small amount of overhead.

        You mentioned you need to have the entire data. So I'll assume that you've done your research; your due diligence, and that there really is no algorithm that would allow you to work on the data in chunks instead of all at once. That's fine. So if using a 64 bit Perl still doesn't get you enough wiggle room, then you have to start looking at a random-access-file (seek/tell), or transcoding (converting each byte to its smallest possible representation, possibly two bits per [ACGT].


        Dave

        Even if the server does actually have 512GB or RAM you may only have access to a percentage of that. You may want to check with your sysadmin to see if there are user/process based limits on resource usage.