Re: load a file in memory and extract parts
by aaron_baugher (Curate) on May 05, 2015 at 21:26 UTC
|
In general, when you want to use keys from one file to lookup values in another file, you load the keys from one file into a hash as its keys, then loop through the other file checking to see if each line's key exists in the hash, and doing something with it if it is. Unless there's a reason to do otherwise, it's usually best to load the smaller file (in this case your 5K one) into the hash, then loop through the other file. So in pseudo-code:
open 5k file
foreach line
get key from line and put it in hash as key=1
close 5k file
open 100M file
foreach line
get key from line
if key is in hash from other file
do stuff with the line
close 100M file
Once you have some code which attempts to do that, show it to us along with a few lines of sample input and output data, and we can guide you further if you need it.
Aaron B.
Available for small or large Perl jobs and *nix system administration; see my home node.
| [reply] [d/l] |
|
|
| [reply] |
Re: load a file in memory and extract parts
by ww (Archbishop) on May 05, 2015 at 21:18 UTC
|
You might want to clarify your statement of your initial quantities.
I (perhaps carelessly) first read "100mb 60k rows)" as 1 x 106 (aka '1.e+6' aka 'one million') rows, each with 10 columns. Unless most of the columns contain only a single character (or something close thereto), you're looking at a lot to slurp ... which see using Super Search here.
But a "mb" can mean 10242 bytes or 1,000,000 (decimal) bytes (often used by manufacturers of hard disks and computers to exagerate the size of their product)... and in any case, it's not conventionally used to [state|discuss|evaluate] a pure number.
| [reply] [d/l] |
Re: load a file in memory and extract parts
by GotToBTru (Prior) on May 05, 2015 at 22:03 UTC
|
Build an index using the key as the hash index, and the line number as the value. Use Tie::File to keep from having to slurp all 60k records at once.
| [reply] |
Re: load a file in memory and extract parts
by locked_user sundialsvc4 (Abbot) on May 05, 2015 at 23:22 UTC
|
Obviously, the first notion that pops into my mind is: “this would be a great application for an SQLite database file!”
SQLite, as you know, is a powerful SQL system that stores everything into just one file. It already has existing facilities to import a CSV-file. It can quite-effortlessly import your file into a table, then create indexes on that table to allow the key (which consists of two columns ...) to be retrieved very fast. You can, if you like, even import the incoming file into another table, and then your entire problem has basically been reduced to an INNER JOIN.
Food for thought, anyway. Personally, I love to stumble-upon situations where I don’t have to write a single line of “programming” in order to solve a problem. Maybe this will work for you; maybe not. But if it does, you could chomp this entire problem in a matter of minutes.
| |
|
|
sub TestArray {
my @file_array;
my $InputFile = "/home/test/test_file.dat"; # '|' separator file
my $column;
my $row;
tie @file_array, 'Tie::File', $InputFile or die $!;
shift @file_array;
foreach $row (0..@file_array-1)
{
foreach $column (0..@{$file_array[$row]}-1)
{
$LOG->info( "Element [$row][$column] = $file_array[$row][$
+column]\n");
}
}
}
At this point I'm trying to load the file into a multidimentional array. The file has 10 columns and 50,000 rows. How can I extract data from this array. When I run this, I'm getting the following error:
Can't use string ("DEAL_NAME|TRANCHE_NAME|GROUP_NO|") as an ARRAY ref while "strict refs" in use at ./test_feeds.pl
| [reply] [d/l] |
|
|
Tie::File lets you access each line of your file as an array element. It does not load the file into a multidimensional array. The reason for the warning is that you are trying to access an element in an Array-of-Arrays data structure, but you just have a simple array (perldsc).
Maybe split will help:
use warnings;
use strict;
my @file_array = qw(a|b|c d|e|f);
foreach my $row (0 .. @file_array - 1) {
foreach my $column (split /\|/, $file_array[$row]) {
print "$row $column\n";
}
}
__END__
0 a
0 b
0 c
1 d
1 e
1 f
| [reply] [d/l] |
|
|
|
|