in reply to write to Disk instead of RAM without using modules

Do you really need to load all these files in memory at the same time? Maybe you can load only some and then others. You don't give enough information for us to figure out that, but think carefully about it, any solution writing to a disk is likely to be much slower.

Otherwise, Data::Dumper, a standard core module, can stringify a data structure for storage into a file, and you can get back to the data structure with string eval. I just can't say if this will be fast enough, but, at least, this is a module that is there, you don't need to install it.

Replies are listed 'Best First'.
Re^2: write to Disk instead of RAM without using modules
by Anonymous Monk on Oct 22, 2016 at 07:24 UTC
    I need to compare all the files simultaneously then how can I load some files and then others. I couldn't understand.
      I can't answer your question because you don't give enough details. But I am doing a lot of file comparisons at $work, most of the time with very large files. Various strategies permit to avoid loading all of them into memory. But a lot depends on the details. For example, are you looking for what we call "orphans", i.e. records that are in file 1 and not in file 2, or the other way around? Or are you rather looking for differences between records that have the same identifying key? Or both? Are you looking for common records, or are you looking for differences? The answer to this question may lead to an entirely different strategy.

      Sometimes, you can load just one file into memory and then scan the other files one by one and, for each file, line by line, without ever loading the other entire files into memory. And, as a second step, compare the generated files containing the differences between the other files and file 1, which may (or may not) be much smaller than the original files, depending on your data shape.

      Another approach (especially if the files are truly huge) is to sort the files according to the comparison key prior to the comparison and then read all of your files line by line in parallel. There is a penalty in sorting the files before the comparison, but it is often worth the cost, because the multifile comparison is then much faster. And, depending on where tour files are coming from, some of them may already be sorted.

      Each case is different, so that there is no general strategy blindly applicable to your specific problem, and this is why I can't suggest a solution without knowing in details what you're really comparing and what kind of differences (or common records) you're looking for.

        I have multiple fastq files in the following format. I want to print the total count if the second line i.e the sequence matches in all files.
        R1.txt @NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA + AAAAAEEEEAEEEEEEEEEE/AEEEEEEEEEEEE 1:R1.txt
        R2.txt @NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA + AAAAAEEEEAEEEEEEEEEE 1:R2.txt
        The output I want is: output @NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA + AAAAAEEEEAEEEEEEEEEE/AEEEEEEEEEEEE 1:R1.txt 1:R2.txt count:2
        My code is: #!/usr/bin/env perl use strict; use warnings; no warnings qw( numeric ); my %seen; $/ = ""; while (<>) { chomp; my ($key, $value) = split ('\t', $_); my @lines = split /\n/, $key; my $key1 = $lines[1]; $seen{$key1} //= [ $key ]; push (@{$seen{$key1}}, $value); } foreach my $key1 ( sort keys %seen ) { my $tot = 0; my $file_count = @ARGV; for my $val ( @{$seen{$key1}} ) { $tot += ( split /:/, $val )[0]; } if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\tcount:". $tot."\n\n"; } }
        This is working well with some files but when I compare more files it hangs. I think it is because of memory issue. Without using any modules I want to modify this script so that it can work with any number of files.

      Show your existing code.

      Also, consider tools like diff, diff3, TortoiseMerge.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)