Re: Solve the large file size issue

probably a bit late, but to throw a database in (DBD::CSV):

#!/usr/bin/env perl
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect('dbi:CSV:', undef, undef, {f_dir => '/tmp', f_e
+xt => '.txt'}) or die $DBI::errstr;
my $select = $dbh->do('CREATE TABLE CombineDie1Die2 AS SELECT WL, BL, 
+Die1, Die_2 as Die2 from Die1_10k a, Die2_10k b Where a.WL = b.WL and
+ a.BL = b.BL') or die "Failed to execute - " . $dbh->errstr;
[download]

The part "Die_2 as Die2" is because of the space in your File 2's field name
This is a "inner join", thus WL,BL combinations are skipped if they exist in only one of the files
for syntax see → SQL::Statement::Syntax (as http://sqlmonks.org seems to be dormant since its inception) :-)

Post Scriptum: OK, this example is convoluted, but the general advantage of SQL is: you can concentrate more on what you want to do with the data instead of how.

Comment on Re: Solve the large file size issue Select or Download Code

Replies are listed 'Best First'.
Re^2: Solve the large file size issue by Anonymous Monk on Apr 15, 2015 at 15:44 UTC
Hi Soonix, I am not understood most of the syntax stated. Is that meant that you still in Perl but you are calling SQL type of structure? Also, I tried to copy your code and run in command prompt. However, I only able to get one line of output as below. `wl,bl,die1,die_2` Could you guide me on the coding side if I wish to modify this script to able open 2 input files name "Die1.txt" and "Die2.txt". Then let it extract all the 8 millions of line from both file and output to the "CombineDie1Die2.txt". Appreciate if you could explain the logic too.	[reply] [d/l]
Re^3: Solve the large file size issue by soonix (Chancellor) on Apr 16, 2015 at 07:37 UTC
Well, we stay in perl plus use an additional language (SQL), the disadvantage being, obviously, that we now have two languages. However, I find this (for some applications, not for everything) better suited. The parameters to the `connect` have the information to interface your environment (plus perl, of course) with the "database". The parameters in my example say that the tables are `.txt` files, so the OP's file "Die1_10k.txt" is a table "Die1_10k" (my example looks for it in the `/tmp` folder/directory). Different mappings of file names to table names are possible, this is described in → DBD::CSV. Possible difficulties you may encounter: SQL is partially case insensitive, your file system maybe not (most Linux/Unix file systems). I prefer all lowercase names because this makes the least problems. spaces and other characters in file names - what makes a valid file name in your file system may not make a valid table name in SQL similiar considerations for field/column names (DBD::CSV tries to get the field names from the first row, but you can instruct it otherwise) I should have written the SQL a bit tidier, e.g. `CREATE TABLE CombineDie1Die2 AS SELECT WL, BL, Die1, Die_2 as Die2 FROM Die1_10k a, Die2_10k b WHERE a.WL = b.WL AND a.BL = b.BL` [download] so that it is more readable. The OP's data contains column headings, with one column named "`Die 2`" (with space in between), DBD::CSV sees this as Die_2 (with underscore), but we want it to be simply "Die2". If your column names are different (or your data doesn't have a header line), you have to decide where to adjust: the first line of your data, the info in the `connect` statement, or the names in the SQL statement itself, like some "renaming" as I did with that field or the tables (`a` and `b`). What the program printed for you, was the heading only, meaning that no data was `SELECT`ed, either because your data does not match (what I wrote about "inner join" at the end of my previous posting), or your columns are named differently (with the effect, that they don't match) ... Update: to find out more, you could try out a simple SQL statement such as: `use strict; use warnings; use Data::Dumper; use DBI; my $dbh = DBI->connect('dbi:CSV:', undef, undef, { f_dir => '/tmp', # where your files live f_ext => '.txt', }) or die $DBI::errstr; my $select = $dbh->prepare( 'SELECT * FROM Die1_10k' ) or die "Failed to prepare - " . $dbh->errstr; $select->execute or die "Failed to execute - " . $select->errstr; print Dumper $select->{NAME}, $select->fetchrow_hashref();` [download] This prints out your column names, and the first row of data (or `undef` if it doesn't find data). Update 2: I don't have real big CSV files, so I don't know how well this will work out, but if it is 2 files x 8 Million lines x 12 Bytes (let's assume some Overhead, so 128 Bytes) → 2 Gig, so, depending on how "modern" your Computer is, it might fail. Anyway, you did not yet tell us, wether - your data is sorted - your data is "in sync" (as assumed in GrandFather's example) The database solution would work for unsorted data also, but if it is, this would be an advantage for the other solutions.	[reply] [d/l] [select]