in reply to Importing data to build an array

Hi, I don't know if you can keep all two (or more?) files in memory... So I fiddled around with an algorithm that roughly needs to keep the significant data of the first file in memory plus the result for the next file to be compared. Basically it works like this:

  1. The first file is read. The expected order of keys is learned plus the mappings from the first column (key) to the last column (value).
  2. For each file to be compared against the first file:
    1. read a CSV line and analyse it
    2. if the line matches the expected key (e.g. DISTINGUERE TRA), save the result and advance to the next line/key
    3. if not, add a zero to the result unless the current line matches the expected key; advance to the next line/key
    4. add more zeros to the result in case an EOF occurred before the last expected key was encountered
    5. perform a lot of sanity checks along the way
  3. Print the results.
Output:

Entries: DISTINGUERE TRA;MANCANTE DI;APPLICARE SU;MONTATO IN;IMPIEGATO IN;RAGGRUPPARE IN
File 1: 9,18246152003019; 7,18246152003019; 6,9898164420878; 6,70441422322555; 12,9959266915162; 6,22163211731087
File 2: 9,18246152003019; 0; 6,9898164420878; 6,70441422322555; 12,9959266915162; 0
File 3: 9,18246152003019; 0; 0; 0; 0; 6,22163211731087
File 4: 0; 0; 0; 6,70441422322555; 0; 0
item found that is not in first file: MONTATO INorOUTorWhatever / 6,70441422322555

Maybe this is just too complicated (premature optimisation), but I hope it helps as a starter...

# UPDATE!!!! - After downloading, I noticed that this script didn't # work any longer. Problem was, that additional CR's # were added to the __DATA__ section. # Adding s/\x0d//g; to extract_csv_entries() fixed # this problem under Linux. # TODO: # [ ] use files instead of DATA # [ ] use a CPAN module to parse CSV files # [ ] simplify / it is much easier if all files can be hold in memory: # - read second file to $file2_data{$key}=$value # - then do something along: # @result = map { exists $file2_data{$_} ? $file2_data{$_} # : "0" # } @ordered_items; use strict; my %file1_data; # $file1_data{key} = value of last column my @ordered_items; # preserves order of keys from first file my $expected_no_of_entries = 10; # expected entries in a CSV line sub extract_csv_entries { # in a real world program, one would use a CSV module from CPAN... my $line = shift; # NOTE: maybe you need to remove the next line under Windows/Mac? $line =~ s/\x0d//g; my @csv_items = split /;/, $line; if (@csv_items != $expected_no_of_entries) { die "illegal number of entries: $line ... \n"; } return @csv_items[0,-1]; # first & last entry } sub compare_file { # here, we simulate to read from files... my $expect = 0; my @result; while (<DATA>) { next if /^\s*$/; last if /^EOF/; chomp; my ($key,$value) = extract_csv_entries($_); if (exists $file1_data{$key}) { # Update: uncomment these lines to ensure that # entries are the same as in 1st file... # if ($file1_data{$key} ne $value) { # ensure val's didn't chang +e # die "enties for $key differ: $file1_data{$key} <=> $value"; # } } else { die "item found that is not in first file: $key / $value\n"; } # advance expected key until match while ($key ne $ordered_items[$expect++]) { push(@result,0); } push(@result, $value); # finally in sync. } # EOF before last expected key? Pad with zeros... push(@result,0) for ($expect..$#ordered_items); if (@result != @ordered_items) { # paranoia die("internal error: " . join(";", @result)); } return \@result; } sub print_result { my ($file_no, $aref) = @_; print "File $file_no: ", join("; ", @{$aref}), "\n"; } # Step 1 - learn the key/value pairs and key-order # read the first "file" (emulated here) while (<DATA>) { next if /^\s*$/; # skip empty lines last if /^EOF/; # emulate eof chomp; my ($key,$value) = extract_csv_entries($_); push @ordered_items, $key; # learn order from first file $file1_data{$key} = $value; # finally learn key/value } # print the list of items for 1st file in original order print "Entries: ", join (";", @ordered_items), "\n"; print_result(1, [ (map { $file1_data{$_} } @ordered_items) ]); # now compare some sample files... for my $file_no (2..5) { print_result($file_no, compare_file() ); } __DATA__ DISTINGUERE TRA;1;14;507;0,000000242475382686773;0,0000033946553576148 +2;0,000122935019022194;0,00000000041732202096217;9,18246152003019;9,1 +8246152003019 MANCANTE DI;1;56;507;0,000000242475382686773;0,0000135786214304593;0,0 +00122935019022194;0,00000000166928808384868;7,18246152003019;7,182461 +52003019 APPLICARE SU;1;64;507;0,000000242475382686773;0,0000155184244919535;0, +000122935019022194;0,00000000190775781011278;6,9898164420878;6,989816 +4420878 MONTATO IN;1;78;507;0,000000242475382686773;0,0000189130798495683;0,00 +0122935019022194;0,00000000232507983107495;6,70441422322555;6,7044142 +2322555 IMPIEGATO IN;2;180;507;0,000000484950765373545;0,0000436455688836191;0 +,000122935019022194;0,00000000536556884094218;6,49796334575812;12,995 +9266915162 RAGGRUPPARE IN;1;109;507;0,000000242475382686773;0,0000264298167128582 +;0,000122935019022194;0,00000000324915002034832;6,22163211731087;6,22 +163211731087 EOF of first file DISTINGUERE TRA;1;14;507;0,000000242475382686773;0,0000033946553576148 +2;0,000122935019022194;0,00000000041732202096217;9,18246152003019;9,1 +8246152003019 APPLICARE SU;1;64;507;0,000000242475382686773;0,0000155184244919535;0, +000122935019022194;0,00000000190775781011278;6,9898164420878;6,989816 +4420878 MONTATO IN;1;78;507;0,000000242475382686773;0,0000189130798495683;0,00 +0122935019022194;0,00000000232507983107495;6,70441422322555;6,7044142 +2322555 IMPIEGATO IN;2;180;507;0,000000484950765373545;0,0000436455688836191;0 +,000122935019022194;0,00000000536556884094218;6,49796334575812;12,995 +9266915162 EOF of second file DISTINGUERE TRA;1;14;507;0,000000242475382686773;0,0000033946553576148 +2;0,000122935019022194;0,00000000041732202096217;9,18246152003019;me +differs! RAGGRUPPARE IN;1;109;507;0,000000242475382686773;0,0000264298167128582 +;0,000122935019022194;0,00000000324915002034832;6,22163211731087;6,22 +163211731087 EOF of dummy third file MONTATO IN;1;78;507;0,000000242475382686773;0,0000189130798495683;0,00 +0122935019022194;0,00000000232507983107495;6,70441422322555;6,7044142 +2322555 EOF of dummy fourth file MONTATO INorOUTorWhatever;1;78;507;0,000000242475382686773;0,000018913 +0798495683;0,000122935019022194;0,00000000232507983107495;6,704414223 +22555;6,70441422322555 EOF of illegal fifth file with illegal entry

Update: patched as requested

Replies are listed 'Best First'.
Re^2: Importing data to build an array
by remluvr (Sexton) on Feb 22, 2009 at 08:47 UTC
    Thanks to both of you. I'm basically new to perl. I'll try what you suggest.. Thanks!
Re^2: Importing data to build an array
by remluvr (Sexton) on Feb 22, 2009 at 10:31 UTC
    It works, the only problem is this line:
    if (exists $file1_data{$key}) { if ($file1_data{$key} ne $value) { # ensure val's didn't change
    I don't need to ensure vals don't change. I need to write different arrays which maintains the order of elements, even if they're different. I played around this line for a while, but I'm not able to change it. Any suggestions?