in reply to modification of the script to consume less memory with higher speed

Start by describing the problem. How many files, how many paragraphs? What is the significance of '\t' in your paragraphs? Must the paragraphs match character-for-character or is only the second line important? Actually, scratch that.

Give us a sample of your "paragraph". Describe what you are wanting to do, not how.

  • Comment on Re: modification of the script to consume less memory with higher speed
  • Download Code

Replies are listed 'Best First'.
Re^2: modification of the script to consume less memory with higher speed
by Anonymous Monk on Jul 30, 2016 at 05:00 UTC
    I have multiple fastq files in the following format giving the reads and the number of times each read occur in a file, separated by tab:
    data1.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt
    data2.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data +2.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :dat +a2.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :da +ta2.txt
    I want to sum the occurrence of read in all the files if the second line of each read matches i.e the output for the above two files should come like this:
    @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt 1 :data2.txt count:2 @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt 2 :data2.txt count:5 @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt 2 :data2.txt count:4
    My code is working with file upto 10GB but if the file exceed this size it hangs. I want my script to run any size of file. Any help will be appreciated.
      My code is working with file upto 10GB but if the file exceed this size it hangs.

      Is that 10GB, all the files together; or just one of the files?

      How much memory does your machine have?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        yes that 10GB is all files put together. I want my script to run consuming less memory because we need to deal with large data files.
      Hi,

      since scalability is one of your top priorities consider using a Key Value Database like Redis, MemcacheDB or ...

      You appear to keep the first record that is seen, in full, while subsequent matching records are only tallied by their count. In that right?

      Now, the remaining question is, do you want the output records to keep the order in which they are processed, or is it acceptable if they appear in random order?

      If any output order will do, then the simplest way to process your job is to divide it up in parts. For example, you can dump the records in temporary files, according to first few letters of the key. Lets say, the intermediate files are TACA.tmp, CATT.tmp, AGAT.tmp, etc. After that, process each temp. file individually, appending the output to final result. Questions?

        I am sorry but I am unable to follow your suggestion as I am a beginner in perl. It would be helpful if you could explain it with example or modification in my script if possible. The output records is acceptable in a random order but the complete second line should match in all files and the count is given accordingly.