Re: modification of the script to consume less memory with higher speed

Replies are listed 'Best First'.
Re^2: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 05:00 UTC
I have multiple fastq files in the following format giving the reads and the number of times each read occur in a file, separated by tab: `data1.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt` [download] `data2.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data +2.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :dat +a2.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :da +ta2.txt` [download] I want to sum the occurrence of read in all the files if the second line of each read matches i.e the output for the above two files should come like this: @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt 1 :data2.txt count:2 @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt 2 :data2.txt count:5 @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt 2 :data2.txt count:4 [download] My code is working with file upto 10GB but if the file exceed this size it hangs. I want my script to run any size of file. Any help will be appreciated.	[reply] [d/l] [select]
Re^3: modification of the script to consume less memory with higher speed by BrowserUk (Patriarch) on Jul 30, 2016 at 05:13 UTC
My code is working with file upto 10GB but if the file exceed this size it hangs. Is that 10GB, all the files together; or just one of the files? How much memory does your machine have? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 06:25 UTC
yes that 10GB is all files put together. I want my script to run consuming less memory because we need to deal with large data files.	[reply]
Re^3: modification of the script to consume less memory with higher speed by ablanke (Monsignor) on Jul 30, 2016 at 13:24 UTC
Hi, since scalability is one of your top priorities consider using a Key Value Database like Redis, MemcacheDB or ...	[reply]
Re^3: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 05:50 UTC
You appear to keep the first record that is seen, in full, while subsequent matching records are only tallied by their count. In that right? Now, the remaining question is, do you want the output records to keep the order in which they are processed, or is it acceptable if they appear in random order? If any output order will do, then the simplest way to process your job is to divide it up in parts. For example, you can dump the records in temporary files, according to first few letters of the key. Lets say, the intermediate files are TACA.tmp, CATT.tmp, AGAT.tmp, etc. After that, process each temp. file individually, appending the output to final result. Questions?	[reply]
Re^4: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 06:22 UTC
I am sorry but I am unable to follow your suggestion as I am a beginner in perl. It would be helpful if you could explain it with example or modification in my script if possible. The output records is acceptable in a random order but the complete second line should match in all files and the count is given accordingly.	[reply]
Re^5: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 07:46 UTC