Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This is more of a "Which way to do it" question than a "How to do it" question.
The idea is to create a "pairs" data set from two existing data sets.
I have two CSV files which I am accessing via DBI:CSV. They look like this:

Date, Open, High, Low, Close, Volume
19990101, 10, 11, 9, 10.5, 500000
19990102, 11, 13, 9, 10.5, 500000
19990103, 10, 11, 9, 10.5, 500000
etc...


I want to iterate through one data set, make sure that the second data set has an entry for that date, and create the third data set by combining the values of the two existing ones.

I would create a hash of hashes with the date being the key for each data set. I would then iterate through one hash, see if the date exists in the second hash, and then do my calculations by referencing the values I want per hash per date and create a third hash with the new values. Easy enough (I think).

Here is the hairy part. If one hash has a date that doesn't exist in the other hash, I want to have that date in the third hash regardless of which hash it exists in. To illustrate my concern:

Data Set 1

Date, Open, High, Low, Close, Volume
19990101, 10, 11, 9, 10.5, 500000
19990102, 11, 13, 9, 10.5, 500000
19990103, 10, 11, 9, 10.5, 500000
-----------> missing date
19990105, 11, 13, 10, 12, 50000

Data Set 2

Date, Open, High, Low, Close, Volume
19990101, 10, 11, 9, 10.5, 500000
19990102, 11, 13, 9, 10.5, 500000
19990103, 10, 11, 9, 10.5, 500000
19990104, 8, 9, 7, 8.5, 500000
-----------> missing date
19990106, 11, 13, 10, 12, 500000

New Data Set

Date, Open, High, Low, Close, Volume
19990101, 10, 11, 9, 10.5, 500000
19990102, 11, 13, 9, 10.5, 500000
19990103, 10, 11, 9, 10.5, 500000
19990104, 8, 9, 7, 8.5, 500000
19990105, 11, 13, 10, 12, 50000
19990106, 11, 13, 10, 12, 500000


Here is what I have as far as code is concerned, just for constructive critisism purposes.
#! /usr/bin/perl -w use strict; use DBI; my $pair1 = shift; my $pair2 = shift; my $dbh = DBI->connect("DBI:CSV:") or die "Cannot connect: " . $DBI::errstr; $dbh->{RaiseError} = 1; my $query1 = "SELECT * FROM $pair1"; my $query2 = "SELECT * FROM $pair2"; my $getpair1 = $dbh->prepare($query1) or die "Cannot prepare: " . $dbh->errstr(); $getpair1->execute() or die "Cannot execute"; my $getpair2 = $dbh->prepare($query2) or die "Cannot prepare: " . $dbh->errstr(); #$getpair2->execute() or die "Cannot execute"; our $pair2_hash = $dbh->selectall_hashref($getpair2, 1); our @pair1_array; while (my $pair1_array = $getpair1->fetchrow_arrayref({Date =>1})) { if (exists($pair2_hash->{$pair1_array->[0]})) { # # Call function to write new data set # # } } $getpair1->finish(); $getpair2->finish();

Replies are listed 'Best First'.
Re: Data Set Combination
by jZed (Prior) on Aug 26, 2005 at 20:32 UTC
    I think you can do this directly in DBD::CSV with a combination of joins, user-defined functions, and the new CREATE TABLE AS SELECT ... syntax. The last two features require the SQL::Statement 1.14 or later.

    Here's an example of a simple calculation with a User Defined Function

    #!/usr/bin/perl -w use strict; use DBI; my $dbh=DBI->connect("dbi:CSV(RaiseError=1):csv_eol=\n"); sub MyCalc { $_[3] * 2 } $dbh->do("CREATE FUNCTION MyCalc"); my $sth = $dbh->prepare(q{ SELECT sales,MyCalc(sales) FROM Test }); $sth->execute; $sth->dump_results(); __END__ # # given sales values of 2 and 5, this script outputs # # 2,4 # 5,10

    See SQL::Statement::Functions for more on creating User Defined Functions.

    To create a new table from a join of two existing tables, use something like:
    CREATE TABLE new AS SELECT $cols FROM old1 JOIN old2 WHERE ...
      Interesting idea of using a function within the SQL statement to combine the sets to a third table on-the fly. I have no experience in this, but it seems very doable. I will have to do the required reading before starting. Thank you for the cool option.
Re: Data Set Combination
by sgifford (Prior) on Aug 26, 2005 at 19:48 UTC
    Do the rows always correspond one-to-one, except in the case where data is missing from one or the other?

    If so, a solution is straightforward. You'll find elements in the first hash but not the second as you iterate over it. If you delete elements from the second hash as you find them, anything left in the second hash will be items that weren't in the first hash.

    If there can be a many-to-one correspondence between the two, you can pull the first array into a hash with selectall_hashref, find elements in the first but not the second while iterating over the first array, then iterate over the second array to search for items which aren't in the first.

    You can also have the SQL server do most of the work for you by JOINing the tables in an appropriate way. For example, a SELECT ... FROM table1 LEFT JOIN table2 ... will return all elements of table1, where possible joining the data from table2, and where not replacing it with null. SELECT ... FROM table2 LEFT JOIN table1 WHERE ... AND table1.something IS NULL will return all items in table2 with no corresponding row in table1. Maybe a better SQL hacker than I would know one query to efficiently do the whole thing.

      No, the data sets are from different markets, with different values. What I want to do is create a third data set by performing calculations on the values of each set and putting the result in the new data set. If a row exists in one but not the other, I would simply take the previous row of the non-existing row's data set and perform the calculations using those values. So, joining the values via SQL statements is not what I am looking for. Thanks for your input though.
        Ah, I guess I don't understand your problem then. So you're saying if you have a missing day in one of your data sets, you want to reuse the previous day's data for calculations on that row? If so, sort the two data sets by date, then move through both in parallel. Track the "current row" and "next row" for both data sets, and loop over a series of dates. For each data set, if the date of "next row" is equal to the date you're currently looking at, set "current row" to "next row" and read another item into "next row". Then perform calculations based on the each data set's "current row", which will either contain the data from the current day or from the last day for which data is available.

        How to deal with a missing first row would have to be a special case.

        Hope this helps; no time to write up any sample code right now.

Re: Data Set Combination
by injunjoel (Priest) on Aug 27, 2005 at 04:05 UTC
    Greetings,
    #!/usr/bin/perl -w use strict; use DBI; my $pair1 = shift; my $pair2 = shift; my $dbh = DBI->connect("DBI:CSV:") or die "Cannot connect: " . $DBI::errstr; $dbh->{RaiseError} = 1; my $query1 = "SELECT * FROM $pair1"; my $query2 = "SELECT * FROM $pair2"; my $getpair1 = $dbh->prepare($query1); my $getpair2 = $dbh->prepare($query2); my %pair1_hash = do{ $getpair1->execute(); while($_ = $getpair1->fetchrow_hashref()){ $_{$_->{Date}} = $_; } %_; }; my %pair2_hash = do{ $getpair2->execute(); while($_ = $getpair2->fetchrow_hashref()){ $_{$_->{Date}} = $_; } %_; }; my %combo_hash = do{ map{ if(!defined $pair1_hash{$_}){ $_, $pair2_hash{$_}; }elsif(!defined $pair2_hash{$_}){ $_, $pair1_hash{$_}; }else{ #computations here to combine the contents of #%pair1_hash and %pair2_hash. Utilize the fact that #the last statement evaluated will be the return value. #just as and example my %t; #reference to specific fields within each hash referred to #by $pair1_hash{$_} and $pair2_hash respectively $t{Open} = sprintf("%.2f",($pair1_hash{$_}->{Open} + $pair +2_hash{$_}->{Open} / 2)); $t{High} = sprintf("%.2f",($pair1_hash{$_}->{High} + $pair +2_hash{$_}->{High} / 2)); #So finally we return the key and a reference to #our computed hash. $_,\%t; } }do{ undef @_{keys %pair1_hash, keys %pair2_hash}; sort keys %_; }; };

    The above is untested but the important parts are the creation of hashes keyed by dates (%pair1_hash & %pair2_hash). With our do block we utilize a hash to get all unique values from the list of keys of both hashes, in effect giving us all the unique dated between both hashes. Then with our map statement we take advantage of the fact that the last statement evaluated is the return value so we do some return logic with our if/elsif/else construct. This works because the key must be present in at least one of the hashes that contributed to the list from our do block. I did a sprintf on the average of two of the example columns from your posting above but you will need to do whichever calculations make sense for your purposes. The strategy though is to construct an intermediate hash to hold your computed values then return a reference to it under the key (in effect date) that our map is currently set to, from the list returned from our do block. whew!
    Is that what you are looking for?

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
      THIS looks like what I had in mind. Because I am more familiar with this code, I will work with this before trying the SQL table join above. Thank you for all the replies, I will try this out now and let you know how it turns out.

      Peace be unto you.
Re: Data Set Combination
by Anonymous Monk on Aug 26, 2005 at 20:22 UTC
    If you do not use a hash but an array you can sort your records according to the date. Then you set a "cursor" at each market. You check that the date is the same at both cursors, then you calculate and slide both of the cursors forward. If the dates are not the same, the lower one is the missing. You do whatever you need to do and slide only the lower one forward.
Re: Data Set Combination
by dragonchild (Archbishop) on Aug 27, 2005 at 02:47 UTC
    Is one of your requirements "You must do this task in one pass"? If not, then why not massage the incoming files into an intermediate stage that's more suitable for your needs. For example, I'd add all the missing dates so that my files corresponded line by line. Then, I'd create the final file from those temporary ones.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?