Re: Importing data to build an array

Hi, I don't know if you can keep all two (or more?) files in memory... So I fiddled around with an algorithm that roughly needs to keep the significant data of the first file in memory plus the result for the next file to be compared. Basically it works like this:

The first file is read. The expected order of keys is learned plus the mappings from the first column (key) to the last column (value).
For each file to be compared against the first file:

read a CSV line and analyse it
if the line matches the expected key (e.g. DISTINGUERE TRA), save the result and advance to the next line/key
if not, add a zero to the result unless the current line matches the expected key; advance to the next line/key
add more zeros to the result in case an EOF occurred before the last expected key was encountered
perform a lot of sanity checks along the way

Print the results.

Output:

Entries: DISTINGUERE TRA;MANCANTE DI;APPLICARE SU;MONTATO IN;IMPIEGATO IN;RAGGRUPPARE IN
File 1: 9,18246152003019; 7,18246152003019; 6,9898164420878; 6,70441422322555; 12,9959266915162; 6,22163211731087
File 2: 9,18246152003019; 0; 6,9898164420878; 6,70441422322555; 12,9959266915162; 0
File 3: 9,18246152003019; 0; 0; 0; 0; 6,22163211731087
File 4: 0; 0; 0; 6,70441422322555; 0; 0
item found that is not in first file: MONTATO INorOUTorWhatever / 6,70441422322555

Maybe this is just too complicated (premature optimisation), but I hope it helps as a starter...


# UPDATE!!!!  - After downloading, I noticed that this script didn't 
#               work any longer. Problem was, that additional CR's 
#               were added to the __DATA__ section.
#               Adding s/\x0d//g; to extract_csv_entries() fixed 
#               this problem under Linux.  
               
# TODO:
# [ ] use files instead of DATA
# [ ] use a CPAN module to parse CSV files
# [ ] simplify / it is much easier if all files can be hold in memory:
#      - read second file to $file2_data{$key}=$value
#      - then do something along:
#         @result = map { exists $file2_data{$_} ? $file2_data{$_} 
#                                                : "0" 
#                       } @ordered_items;
 
use strict;

my %file1_data;      # $file1_data{key} = value of last column
my @ordered_items;   # preserves order of keys from first file
my $expected_no_of_entries = 10; # expected entries in a CSV line


sub extract_csv_entries {
  # in a real world program, one would use a CSV module from CPAN...
  my $line = shift;

  # NOTE: maybe you need to remove the next line under Windows/Mac?
  $line =~ s/\x0d//g; 
  my @csv_items = split /;/, $line;

  if (@csv_items != $expected_no_of_entries) {
     die "illegal number of entries: $line ... \n";
  }

  return @csv_items[0,-1]; # first & last entry
}


sub compare_file { # here, we simulate to read from files...
  my $expect  = 0;
  my @result;

  while (<DATA>) {
    next if /^\s*$/;
    last if /^EOF/;
    chomp;

    my ($key,$value) = extract_csv_entries($_);
    if (exists $file1_data{$key}) {
# Update: uncomment these lines to ensure that
#         entries are the same as in 1st file...
#       if ($file1_data{$key} ne $value) { # ensure val's didn't chang
+e
#          die "enties for $key differ: $file1_data{$key} <=> $value";
#       }
    } else {
       die "item found that is not in first file: $key / $value\n";
    }
    
    # advance expected key until match 
    while ($key ne $ordered_items[$expect++]) {
      push(@result,0);
    }

    push(@result, $value); # finally in sync.
  }

  # EOF before last expected key? Pad with zeros...
  push(@result,0) for ($expect..$#ordered_items);

  if (@result != @ordered_items) { # paranoia
     die("internal error: " . join(";", @result));
  }

  return \@result;
}


sub print_result {
  my ($file_no, $aref) = @_;
  print "File  $file_no: ", join("; ", @{$aref}), "\n";
}


# Step 1 - learn the key/value pairs and key-order
# read the first "file" (emulated here)
while (<DATA>) {
  next if /^\s*$/;   # skip empty lines
  last if /^EOF/;    # emulate eof
  chomp;
  
  my ($key,$value)  = extract_csv_entries($_);
  push @ordered_items, $key;   # learn order from first file
  $file1_data{$key} = $value;  # finally learn key/value
}


# print the list of items for 1st file in original order
print "Entries: ", join (";", @ordered_items), "\n";
print_result(1, [ (map { $file1_data{$_} } @ordered_items) ]);

# now compare some sample files...
for my $file_no (2..5) { 
  print_result($file_no, compare_file() );
}


__DATA__

DISTINGUERE TRA;1;14;507;0,000000242475382686773;0,0000033946553576148
+2;0,000122935019022194;0,00000000041732202096217;9,18246152003019;9,1
+8246152003019
MANCANTE DI;1;56;507;0,000000242475382686773;0,0000135786214304593;0,0
+00122935019022194;0,00000000166928808384868;7,18246152003019;7,182461
+52003019
APPLICARE SU;1;64;507;0,000000242475382686773;0,0000155184244919535;0,
+000122935019022194;0,00000000190775781011278;6,9898164420878;6,989816
+4420878
MONTATO IN;1;78;507;0,000000242475382686773;0,0000189130798495683;0,00
+0122935019022194;0,00000000232507983107495;6,70441422322555;6,7044142
+2322555
IMPIEGATO IN;2;180;507;0,000000484950765373545;0,0000436455688836191;0
+,000122935019022194;0,00000000536556884094218;6,49796334575812;12,995
+9266915162
RAGGRUPPARE IN;1;109;507;0,000000242475382686773;0,0000264298167128582
+;0,000122935019022194;0,00000000324915002034832;6,22163211731087;6,22
+163211731087
EOF of first file

DISTINGUERE TRA;1;14;507;0,000000242475382686773;0,0000033946553576148
+2;0,000122935019022194;0,00000000041732202096217;9,18246152003019;9,1
+8246152003019
APPLICARE SU;1;64;507;0,000000242475382686773;0,0000155184244919535;0,
+000122935019022194;0,00000000190775781011278;6,9898164420878;6,989816
+4420878
MONTATO IN;1;78;507;0,000000242475382686773;0,0000189130798495683;0,00
+0122935019022194;0,00000000232507983107495;6,70441422322555;6,7044142
+2322555
IMPIEGATO IN;2;180;507;0,000000484950765373545;0,0000436455688836191;0
+,000122935019022194;0,00000000536556884094218;6,49796334575812;12,995
+9266915162
EOF of second file

DISTINGUERE TRA;1;14;507;0,000000242475382686773;0,0000033946553576148
+2;0,000122935019022194;0,00000000041732202096217;9,18246152003019;me 
+differs!
RAGGRUPPARE IN;1;109;507;0,000000242475382686773;0,0000264298167128582
+;0,000122935019022194;0,00000000324915002034832;6,22163211731087;6,22
+163211731087
EOF of dummy third file

MONTATO IN;1;78;507;0,000000242475382686773;0,0000189130798495683;0,00
+0122935019022194;0,00000000232507983107495;6,70441422322555;6,7044142
+2322555
EOF of dummy fourth file

MONTATO INorOUTorWhatever;1;78;507;0,000000242475382686773;0,000018913
+0798495683;0,000122935019022194;0,00000000232507983107495;6,704414223
+22555;6,70441422322555
EOF of illegal fifth file with illegal entry
[download]

Update: patched as requested

Comment on Re: Importing data to build an array Download Code

Replies are listed 'Best First'.
Re^2: Importing data to build an array by remluvr (Sexton) on Feb 22, 2009 at 08:47 UTC
Thanks to both of you. I'm basically new to perl. I'll try what you suggest.. Thanks!	[reply]
Re^2: Importing data to build an array by remluvr (Sexton) on Feb 22, 2009 at 10:31 UTC
It works, the only problem is this line: `if (exists $file1_data{$key}) { if ($file1_data{$key} ne $value) { # ensure val's didn't change` [download] I don't need to ensure vals don't change. I need to write different arrays which maintains the order of elements, even if they're different. I played around this line for a while, but I'm not able to change it. Any suggestions?	[reply] [d/l]