stevemayes has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have two files containing user data: I'm trying to do a line by line partial merge: that is to say I take file one, line one which concerns user A N Other and search through file two until I find the line relating to him. Then I want take a single entry from that line (it happens to be the first entry which makes the split relatively simple) and add that entry to the 1st file.

In my own simplistic fashion I simply used a nested loop, splitting the lines in both files to appropriate variable names then printed these out to a third file that should contain all the rows in file one with one extra item.

The third file contains only the first five lines and not the other 550 lines. Despite using -w warnings, strict, and diagnostics, there are no errors which leads me to conclude that the script is doing what I'm telling it to I'm just telling it to do something other than what I intended.

open(OUT1, ">> resultsfile.csv")||die("Failed to open resultsfile.csv: + $!\n"); open(IR, "< file1.csv")|| die("failed to open file1"); open(BS, "< file2.csv")||die ("failed to open file2"); while (defined ($line=<IR>)) { chomp $line; ($username, $firstname, $lastname, $lastlogin, $tokenexpiration) = + split(",", $line); while (defined ($line2=<BS>)) { if ($line2 =~ /$lastname/ && $line2 =~ /$firstname/) { ($bs, undef) = split(",", $line2, 2); print OUT1 "$bs $username,$firstname,$lastname,$lastlogin, +$tokenexpiration\n"; last; # go to next name in outer loop. } } }
__Data__ From file 1 Alan,Bloggs,06/11/2009,11/30/2011 David,Smith,06/08/2009,09/30/2012 Rosario,Anotherone,06/05/2009,11/30/2011 Angela,Madeupname,06/11/2009,07/31/2010 Ugochukwu,Smith,06/01/2009,10/31/2012 Amarjit,Patel,08/19/2008,11/30/2011 Julie,Schmidt,05/01/2009,09/30/2012 Waseem,Alder,06/11/2009,11/30/2011 ... for another 580 lines. File 2 abc,Alan,Bloggs,06/11/2009,11/30/2011,morefields,... cde,David,Smith,06/08/2009,09/30/2012,morefields,... abc,Rosario,Anotherone,06/05/2009,11/30/2011,morefields,... acd,Angela,Madeupname,06/11/2009,07/31/2010,morefields,... cde,Ugochukwu,Smith,06/01/2009,10/31/2012,morefields,... tla,Julie,Schmidt,05/01/2009,09/30/2012,morefields,... tl,Waseem,Alder,06/11/2009,11/30/2011,morefields,... __End of Data__

Note that the script is taking the correct data from each file, but it is only iterating through the first five lines where the desired result is to iterate through all the lines in file1 (correction: it only appears to iterate through the first five lines).

If I remove the line last; # go to next name in outer loop then the output file only has one name in it (the first line).

If I start the outer loop like this LABEL: while (defined ($line=<IR>)) { and replace the 'last;' line with next LABEL;

then the final file contains the first five entries (as with the unlabelled code).

Help in understanding what is going on and a fix is obviously what I'm looking for but comments on the approach are welcome too. Could I have used an array or a hash (yes but should I have and how)? Are there more efficient or clearer ways of achieving the desired result?

oh and while I'm asking...the regex match  if ($line2 =~ /$lastname/ && $line2 =~ /$firstname/) is clunky, how can it be rewritten? I tried  if ($line2 =~ /$lastname/ && /$firstname/) but got unitialised variable errors and if ($line2 =~ /$lastname&&$firstname/) just returns an empty file with no errors registered.

Update

As noted by SuicideJunkie, Transient and Si_lence, the problem was that the pointer was sitting at the bottom of the second file and not as I had imagined, re-searching that file on each succeeding iteration. Placing the line:

seek(BS,0,0);

immediately before the inner loop fixed the issue. (Of course that allows me to move on to the next twelve gazillion issues but that was some very useful information - thank you, I'll vote your responses up with my next available votes).

Update 2

So my final, working solution (for now) which pulls out the data I want, spits it into a file after comparing each line in file one with file two.

Well I swallowed my noobish fears and decided to follow Suicide Junkies et al.'s advice and simply slurped the files into memory (similar to Bichonfrise74's code I think). File one made sense as a hash: {name => value} key pair. File two was an array of arrays including name. So by iterating through file two and pulling out the name each time I was able to pull the value and spit them all out into a results file.

I'm already aware of various changes and improvements I want to make in v.2 but for now this code does what is needed. Thank you all.

Yet another update

I replaced all the individual declarations with two simple ones: my "($..., %..., etc);" and "our ...;"

!/usr/bin/perl -w use Time::Local; use strict; use warnings; use diagnostics; use Data::Dumper; # declare stuff to avoid errors my (%user_bs $user_bs @temp, @temp2, @user_expiration_inf, $firstname, + $current, $username, $lastname, $lastlogin, $tokenexpiration, $month +, $day, $year, $edate, $line, $line2, $bs, $b1, $y, $z, $name, $name2 +); our $/; $current = time(); # slurp up the first file into one huge hash ( name => value ) open(BSFILE, "< file1.csv") || die("Can't open file1.csv: $!\n"); while (defined ($line=<BSFILE>)) { next if $line =~ /^Data here/; chomp $line; ($b1, $y, $z) = split(/,/, $line); $name = "$z $y"; $user_bs{ $name } = $b1; } close(BSFILE); # Now to whack file two into an array of arrays. open(USERLIST, $ARGV[0]) || die("Can't open $ARGV[0]: $!\n"); while (<>) { next if /^Default Login/; # ignore first line ($username, $lastname, $firstname, $lastlogin, $tokenexpiration) = + split(",", $_))[0, 1, 2, 5, 8]; ($month, $day, $year) = split('/', $tokenexpiration); $edate = timelocal(00,00,00,$day,$month-1,$year-1900); next if $edate < $current; # remove from list all users whose toke +ns expired before today $name2 = "$firstname $lastname"; @temp2 = ($username, $name2, $lastlogin, $tokenexpiration); push @user_expiration_inf, [@temp2]; } close(USERLIST); open(OUT, ">>results.csv") || die("Can't open results.csv: $!\n"); print OUT "User, Data Here, Username, Last login, Token expiration dat +e\n"; # nice while loop to iterate through the array of arrays my $i = 0; while ($user_expiration_inf[$i][1]) { print OUT "$user_expiration_inf[$i][1], $user_bs{$user_expirati +on_inf[$i][1]}, $user_expiration_inf[$i][0], $user_expiration_inf[$i] +[2], $user_expiration_inf[$i][3]\n"; $i++; }

I'll look at Txt::CSV and Txt::CSV_XS soon. as well as DBI::RAM. Continued suggestions for improvement are totally welcome. This code is posted in the hope that it will help other total noobs.

Replies are listed 'Best First'.
Re: nested loops to compare 2 files is only looping a limited number of times.
by SuicideJunkie (Vicar) on Jun 23, 2009 at 13:11 UTC
    Consider what happens when something in the first file does not match anything in the second file...

    Your inner loop spins through the second file until it reaches EOF. None of the other lines in the first file are going to match, because there is nothing more to match against.

    One option would be to seek back to the beginning of the second file after each match, to ensure you are comparing against all the possibilities.
    You might also consider loading both files into arrays so that you have less disk I/O to slog through. The downside is that it will guzzle down memory to have the entire file read in at once.

      yes, the loops do break on the first failed match of file 1 against file 2.

      I can see that that is what is breaking it now that you've pointed it out but I was under the impression that when the inner loop had gone to the end of it's file it would break to the next iteration of the outer loop.

      do I need to label the out loop (e.g. LABEL:) and then do something like  if (/EOF/) { next LABEL;} then?

        When the inner loop goes to the end of it's file, it stays there. It doesn't reset the file pointer every time you enter that while loop.

        adding a seek BS, 0, 0 before your inner while loops should get your on the right track.
Re: nested loops to compare 2 files is only looping a limited number of times.
by Transient (Hermit) on Jun 23, 2009 at 13:19 UTC
    First thing I noticed is that you split file 1 like so:
    ($username, $firstname, $lastname, $lastlogin, $tokenexpiration) = spl +it(",", $line);
    but the data looks like this:
    Alan,Bloggs,06/11/2009,11/30/2011
    meaning that $username is set to Alan, $firstname is set to Bloggs, etc.

    Update: To avoid problems, I'd change your regex checking against the entire line ($line2 =~ /$lastname/ ... ) to something more precise.
    my ( $bs, $firstname_line_2, $lastname_line_2 ) = split (/,/,$line2); if ( $lastname eq $lastname_line_2 && $firstname eq $firstname_line_2 +) { ...
      oops, yes, I removed the username and obfuscated the surnames but forgot to remove the variable from the code. sorry.

      Excellent observations. And the fact that the file pointer doesn't automatically reset to the top of the file on each iteration is something that I periodically get hammered by, too.

      I think I would not have bothered with regexs at all. In the OP's posting, the lines from File 1 are already split and it would be trivial to do the same with File 2. I would think it would be clearer and more straightforward to do both splits and then do a simple logical compare on the name-parts.

      The regex's, of course, will work (especially given some of the other suggestions from responders to the OP's post); but it seems like the logical compares would be more readable and maintainable, IMHO. But that's just my perspective.

      ack Albuquerque, NM
Re: nested loops to compare 2 files is only looping a limited number of times.
by si_lence (Deacon) on Jun 23, 2009 at 13:37 UTC
    As noted above by SuicideJunkie you don't start from the beginning of the second file for each pass trough the outer loop, i.e. for each line of file1.

    If your second file is not too big (compared to memory available) I would suggest to read the second file into a hash before the first loop and then loop through file1 and look for matches. You only need one loop then and not the two nested ones.

    cheers

    si_lence

      I especially like si_lence's hash idea. That is what I usually do for exactly the reason that si_lence notes: since the files are not that long, the hash approach allows you to (1) dispense with one of the loops in the nest and (2) makes for very quick locating of matching entries.

      I am embarassed that I didn't see that when I posted my response earlier to the excellent stevemayes response.

      ack Albuquerque, NM
Re: nested loops to compare 2 files is only looping a limited number of times.
by bibliophile (Prior) on Jun 23, 2009 at 14:12 UTC
Re: nested loops to compare 2 files is only looping a limited number of times.
by bichonfrise74 (Vicar) on Jun 23, 2009 at 17:47 UTC
    This might be a moot point now but here a possible solution...
    #!/usr/bin/perl use strict; my $file_1 = <<FILE_1; Alan,Bloggs,06/11/2009,11/30/2011 David,Smith,06/08/2009,09/30/2012 Rosario,Anotherone,06/05/2009,11/30/2011 Angela,Madeupname,06/11/2009,07/31/2010 Ugochukwu,Smith,06/01/2009,10/31/2012 Amarjit,Patel,08/19/2008,11/30/2011 Missing,Man,08/12/2007,12/01/2013 Julie,Schmidt,05/01/2009,09/30/2012 Waseem,Alder,06/11/2009,11/30/2011 FILE_1 my %file_1; open( my $fh, "<", \$file_1 ) or die "Could not open file_1\n"; while (<$fh>) { chomp; my @cols = split ","; $file_1{"$cols[0] $cols[1]"} = $_; } close( $fh ); while (<DATA>) { chomp; my @cols = split ","; my $temp_name = "$cols[1] $cols[2]"; if (grep /$temp_name/, (keys %file_1)) { my $result = "$cols[0]," . join ",", $file_1{$temp_name}; print "$result\n"; } } __DATA__ abc,Alan,Bloggs,06/11/2009,11/30/2011,morefields,... cde,David,Smith,06/08/2009,09/30/2012,morefields,... abc,Rosario,Anotherone,06/05/2009,11/30/2011,morefields,... acd,Angela,Madeupname,06/11/2009,07/31/2010,morefields,... cde,Ugochukwu,Smith,06/01/2009,10/31/2012,morefields,... tla,Julie,Schmidt,05/01/2009,09/30/2012,morefields,... tl,Waseem,Alder,06/11/2009,11/30/2011,morefields,...

      It's not moot. I'm a beginner, and I specifically asked for comments on the approach as well as alternatives.

      I am looking at your code and, if I'm honest, going, huh? at certain points, but I guess I'm just going to run it, stick in some evals or possibly learn Data::Dumper and try and work it out.

      so a genuine thank you for taking the time to write out an alternative approach

      Steve

      (and in response to Bloodnok's sig - I firmly believe that initiate vastly overstates my abilities)