load a file in memory and extract parts

healingtao has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: load a file in memory and extract parts by aaron_baugher (Curate) on May 05, 2015 at 21:26 UTC
In general, when you want to use keys from one file to lookup values in another file, you load the keys from one file into a hash as its keys, then loop through the other file checking to see if each line's key exists in the hash, and doing something with it if it is. Unless there's a reason to do otherwise, it's usually best to load the smaller file (in this case your 5K one) into the hash, then loop through the other file. So in pseudo-code: `open 5k file foreach line get key from line and put it in hash as key=1 close 5k file open 100M file foreach line get key from line if key is in hash from other file do stuff with the line close 100M file` [download] Once you have some code which attempts to do that, show it to us along with a few lines of sample input and output data, and we can guide you further if you need it. Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re^2: load a file in memory and extract parts by afoken (Chancellor) on May 06, 2015 at 06:10 UTC
Tux seems to be offline, so I'll link to Text::CSV_XS: Text::CSV_XS takes care of reading and writing CSV files. Unlike most "five lines of perl" attempts, it handles most, if not all, nasty edge cases. DBD::CSV sits on top of Text::CSV_XS and allows SQL access to CSV files. It may be slower than SQLite proposed by locked_user sundialsvc4, but avoids converting CSV to SQLite. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re: load a file in memory and extract parts by ww (Archbishop) on May 05, 2015 at 21:18 UTC
You might want to clarify your statement of your initial quantities. I (perhaps carelessly) first read "100mb 60k rows)" as `1 x 10⁶` (aka '`1.e+6`' aka '`one million`') rows, each with 10 columns. Unless most of the columns contain only a single character (or something close thereto), you're looking at a lot to slurp ... which see using Super Search here. But a "mb" can mean 1024² bytes or 1,000,000 (decimal) bytes (often used by manufacturers of hard disks and computers to exagerate the size of their product)... and in any case, it's not conventionally used to [state\|discuss\|evaluate] a pure number. Spirit of the Monastery	[reply] [d/l]
Re: load a file in memory and extract parts by GotToBTru (Prior) on May 05, 2015 at 22:03 UTC
Build an index using the key as the hash index, and the line number as the value. Use Tie::File to keep from having to slurp all 60k records at once. Dum Spiro Spero	[reply]
Re: load a file in memory and extract parts by locked_user sundialsvc4 (Abbot) on May 05, 2015 at 23:22 UTC
Obviously, the first notion that pops into my mind is: “this would be a great application for an SQLite database file!” SQLite, as you know, is a powerful SQL system that stores everything into just one file. It already has existing facilities to import a CSV-file. It can quite-effortlessly import your file into a table, then create indexes on that table to allow the key (which consists of two columns ...) to be retrieved very fast. You can, if you like, even import the incoming file into another table, and then your entire problem has basically been reduced to an `INNER JOIN`. Food for thought, anyway. Personally, I love to stumble-upon situations where I don’t have to write a single line of “programming” in order to solve a problem. Maybe this will work for you; maybe not. But if it does, you could chomp this entire problem in a matter of minutes.
Re^2: load a file in memory and extract parts by healingtao (Novice) on May 06, 2015 at 16:36 UTC
I tried the following code `sub TestArray { my @file_array; my $InputFile = "/home/test/test_file.dat"; # '\|' separator file my $column; my $row; tie @file_array, 'Tie::File', $InputFile or die $!; shift @file_array; foreach $row (0..@file_array-1) { foreach $column (0..@{$file_array[$row]}-1) { $LOG->info( "Element [$row][$column] = $file_array[$row][$ +column]\n"); } } }` [download] At this point I'm trying to load the file into a multidimentional array. The file has 10 columns and 50,000 rows. How can I extract data from this array. When I run this, I'm getting the following error: Can't use string ("DEAL_NAME\|TRANCHE_NAME\|GROUP_NO\|") as an ARRAY ref while "strict refs" in use at ./test_feeds.pl	[reply] [d/l]
Re^3: load a file in memory and extract parts by toolic (Bishop) on May 06, 2015 at 16:52 UTC
Tie::File lets you access each line of your file as an array element. It does not load the file into a multidimensional array. The reason for the warning is that you are trying to access an element in an Array-of-Arrays data structure, but you just have a simple array (perldsc). Maybe split will help: `use warnings; use strict; my @file_array = qw(a\|b\|c d\|e\|f); foreach my $row (0 .. @file_array - 1) { foreach my $column (split /\\|/, $file_array[$row]) { print "$row $column\n"; } } __END__ 0 a 0 b 0 c 1 d 1 e 1 f` [download]	[reply] [d/l]
Re^4: load a file in memory and extract parts by healingtao (Novice) on May 06, 2015 at 17:17 UTC
Re^5: load a file in memory and extract parts by afoken (Chancellor) on May 06, 2015 at 19:31 UTC