Re: write to Disk instead of RAM without using modules
by Corion (Patriarch) on Oct 21, 2016 at 11:07 UTC
|
Yes, see tie and easily Tie::File and DB_File.
If you don't want to use a module, just take the code that these modules contain and use that code.
| [reply] |
Re: write to Disk instead of RAM without using modules
by marto (Cardinal) on Oct 21, 2016 at 11:12 UTC
|
| [reply] |
Re: write to Disk instead of RAM without using modules
by Laurent_R (Canon) on Oct 21, 2016 at 21:34 UTC
|
Do you really need to load all these files in memory at the same time? Maybe you can load only some and then others. You don't give enough information for us to figure out that, but think carefully about it, any solution writing to a disk is likely to be much slower.
Otherwise, Data::Dumper, a standard core module, can stringify a data structure for storage into a file, and you can get back to the data structure with string eval. I just can't say if this will be fast enough, but, at least, this is a module that is there, you don't need to install it.
| [reply] [d/l] |
|
|
I need to compare all the files simultaneously then how can I load some files and then others. I couldn't understand.
| [reply] |
|
|
I can't answer your question because you don't give enough details. But I am doing a lot of file comparisons at $work, most of the time with very large files. Various strategies permit to avoid loading all of them into memory. But a lot depends on the details. For example, are you looking for what we call "orphans", i.e. records that are in file 1 and not in file 2, or the other way around? Or are you rather looking for differences between records that have the same identifying key? Or both? Are you looking for common records, or are you looking for differences? The answer to this question may lead to an entirely different strategy.
Sometimes, you can load just one file into memory and then scan the other files one by one and, for each file, line by line, without ever loading the other entire files into memory. And, as a second step, compare the generated files containing the differences between the other files and file 1, which may (or may not) be much smaller than the original files, depending on your data shape.
Another approach (especially if the files are truly huge) is to sort the files according to the comparison key prior to the comparison and then read all of your files line by line in parallel. There is a penalty in sorting the files before the comparison, but it is often worth the cost, because the multifile comparison is then much faster. And, depending on where tour files are coming from, some of them may already be sorted.
Each case is different, so that there is no general strategy blindly applicable to your specific problem, and this is why I can't suggest a solution without knowing in details what you're really comparing and what kind of differences (or common records) you're looking for.
| [reply] |
|
|
|
|
|
|
|
|
|
|
|
| [reply] |
Re: write to Disk instead of RAM without using modules
by Laurent_R (Canon) on Oct 25, 2016 at 16:09 UTC
|
You're looking for records common to a group of files.
Suppose you have one file containing the following words (one word per line): "two, four, six, seven", a second file containing "one, three, four, seven, eight", and a third file containing "two, four, seven, nine".
Read the first file and store the words in a hash with a value 1. You get a hash looking like this:
( two => 1, four => 1, six => 1, seven => 1);
Read the second file line by line, and, for each line, check if the line is in the hash. If it isn't there, just discard the line: it wasn't in the first file, it cannot be in all files. If it is in the hash, just increment the counter for it. Your end up with somehing like this:
( two => 1, four => 2, six => 1, seven => 2);
Repeat the same process with the third file, and you get:
( two => 1, four => 3, six => 1, seven => 3);
Notice that your hash is not growing, even though you've read three files. Only one file has ever been loaded into memory. Now you know that the records common to all files are those whose value is 3, you can just print them or do whatever you want with them.
| [reply] [d/l] [select] |
|
|
This is quick perlish pseudo code for the process just explained in my post just above:
my %seen;
open my $FH, "<", "file1.txt" or die "cannot open ...";
while (<$FH>) {
chomp;
$seen($_) = 1;
}
close $FH;
for my $other_file (qw/ file2.txt file3.txt file4.txt .../) {
open my $FH, "<", $other_file or die "cannot open ...";
while (<$FH>) {
chomp;
if (exists $seen{$_}) {
$seen{$_}++;
}
}
close $FH;
}
Only the first file is ever loaded into memory. When you read the other files, you just update counters. Even if you have hundreds of files, it will work provided the first file can be loaded into the hash.
Update:: moved the second chomp line to the right place (within the while loop, not just before).
| [reply] [d/l] [select] |
|
|
Your suggestion will discard those records which are file specific i.e not present in first file but may be present in others. I want those records also to be printed with their value.
for example:
file1:
aaa 1
abc 1
acb 2
file2:
aaa 2
abb 1
acb 1
file3:
acb 2
aaa 3
abc 1
output:
aaa 1 2 3
abc 1 0 1
acb 2 1 2
abb 0 1 0
| [reply] [d/l] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|