sjd6 has asked for the wisdom of the Perl Monks concerning the following question:

I have one tsv file which which is exported from some cable testing software. I have to distinguish between the data that was already in the file and newly added data during that week, which i have done. I now have 2 tsv files, one that has old data in and one that has the newly added data.
I need to compare each line in the newfile to each in the oldfile and remove duplicates. It is the 2, 3 & 6 elements in each line that need to be compared as the others contain unique data.

e.g.
0 <t> 1 <t> 2 <t> 3 <t> 4 <t> 5 <t> 6
"12345" <t> "cdf34" <t> "l1" <t> "r1" <t> "notes" <t> "" <t> "l1:r1 l2:r2"

Once a dupe has been found the part no. (0) needs to be added to the notes field(4) its dupe of in the oldfile. if a dupe isnt found an 'A' needs to be added to the front of the part no (to make the line approved) and then that line is to be added to the bottom of the oldfile.
I have attempted this by extracting the data from the files into an array and spliting each line at the \t to make each line an array. Ive then used a for loop inside another accessing one line from the newfile and comparing its elements to the elements in every line in the oldfile, the problem is i cant seem edit the information in the files and the lines are printed several times as it loops through.
I hope someone can help me with this, im struggling a bit as ive only been learning perl for the last 2 wks.

thanks, steve.
  • Comment on Comparing files and and elements of arrays

Replies are listed 'Best First'.
Re: Comparing files and and elements of arrays
by dragonchild (Archbishop) on Oct 02, 2003 at 14:41 UTC
    What is the exact code you have been using? It's hard to help you fix something we can't see ...

    As for basic issues, you cannot edit a file directly from Perl. You (generally) have to:

    1. Open a filehandle to read the file
    2. Read the file into some data structure
    3. Close the read-only filehandle
    4. Edit the data structure
    5. Open a filehandle to overwrite the file
    6. Write the data structure to the file
    7. Close the write-only filehandle

    I suspect you're missing the first close and second open. That's a common mistake among newbies. To match the above 7 steps to your needs, you'll need to repeat steps 1-3 for both files.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Comparing files and and elements of arrays
by BrowserUk (Patriarch) on Oct 03, 2003 at 06:07 UTC

    This sort of does something close to what I think you described. The are a couple of obvious (from the output) errors, but the basic mechanisms are there. As supplied it generates some dummy data into two files 'old' & 'new' and the processes them.

    This isn't simple code and will require you to read the comments and code carefully and do some research to understand it given your breif flirtation with perl, but hopefully it will give you a running start.

    #! perl -slw use strict; use Tie::File; ## Initialise the filenames here my( $old, $new ) = ( 'old', 'new' ); ## Comment this out, once you specify your real files above!! genTestData( $old, $new ); ## Use the files as arrays. See Tie::File. tie my @old, 'Tie::File', $old; tie my @new, 'Tie::File', $new; ## Build a lookup table into the old array (file) ## keyed by the catenation of fields 2, 3 & 6; my %old; $old{ join $;, ( split "\t", $old[ $_ ] )[ 2, 3, 6 ] } = $_ for 0 .. $ +#old; ## Remove duplicates from the new file, if any. ## Not sure if this was a requirement, your wording was ambiguous. my %seen; @new = map{ ++$seen{ join( $;, (split '\t' )[ 2, 3 ,6 ] ) } == 1 ? $_ : () } @new; ## Now process the new file line by line for my $lineno ( 0 .. $#new ) { ## Split the TSV data into an array. my @fields = split "\t", $new[ $lineno ]; ## And strip the quotes from the partno for later. $fields[0] =~ s["([^\x22]*)"][$1]; ## Catenate the 3 key fields and do a lookup in the old data table +. if( exists $old{ join $;, @fields[ 2, 3, 6 ] } ) { ## If it exists, edit the line if the old file $old[ $old{ join $;, @fields[ 2, 3, 6 ] } ] ## locate the notes field =~ s[ ("[^\x22]*") (?= (?: \t [^\t]*? ){2}$ ) ] { ## Make a modifiable copy my $notes = $1; ## and append the partno to it $notes =~ s[(?<=")(.*)(?=")][$1:$fields[0]]; ## And return the modified field for substituition ## into the old file record. $notes; }xe; print "Updating old line ", $old[ $old{ join $;, @fields[ 2, 3 +, 6 ] } ]; } ## Else append the new record to the old file ## prefixed with an 'A' else { push @old, 'A' . $new[ $lineno ]; print "Adding new line '", $new[ $lineno ], "' to old file"; } } exit(0); ## This updates the files to disk and closes them. ## Everything from here is for generating test data. sub genTestData { my( $old, $new ) = @_; srand( 1); open OLD, '>', $old or die $!; print OLD genLine() for 1 .. 100; close OLD; open NEW, '>', $new or die $!; print NEW genLine() for 1 .. 20; close NEW; } sub genLine{ join "\t", map{ '"' . $_ . '"' } 10000 + int rand 90000, 'dummy', ('l','r')[ rand() < 0.5 ] . int rand(9), ('l','r')[ rand() < 0.5 ] . int rand(9), 'notes', 'dummy', ''; } __END__ P:\test>295919.pl8 Adding new line '"13471" "dummy" "r1" "r1" "notes" "dummy +" ""' to old file Adding new line '"65827" "dummy" "l8" "r7" "notes" "dummy +" ""' to old file Updating old line "31648" "dummy" "l8" "r8" "notes:68098" + "dummy" "" Adding new line '"69773" "dummy" "r5" "l3" "notes" "dummy +" ""' to old file Adding new line '"94869" "dummy" "l5" "l2" "notes" "dummy +" ""' to old file Adding new line '"45724" "dummy" "r0" "l1" "notes" "dummy +" ""' to old file Updating old line "97885" "dummy" "r5" "r1" "notes:16325" + "dummy" "" Updating old line "95152" "dummy" "l4" "l6" "notes:24029" + "dummy" "" Adding new line '"49715" "dummy" "l3" "l5" "notes" "dummy +" ""' to old file Adding new line '"27962" "dummy" "l7" "r8" "notes" "dummy +" ""' to old file Adding new line '"26677" "dummy" "l5" "r5" "notes" "dummy +" ""' to old file Adding new line '"73764" "dummy" "r2" "r3" "notes" "dummy +" ""' to old file Updating old line "90568" "dummy" "l6" "l3" "notes:90576" + "dummy" "" Adding new line '"45765" "dummy" "l2" "r5" "notes" "dummy +" ""' to old file Updating old line "75975" "dummy" "l6" "l6" "notes:41819" + "dummy" "" Adding new line '"22538" "dummy" "r2" "l8" "notes" "dummy +" ""' to old file Adding new line '"43104" "dummy" "l0" "l1" "notes" "dummy +" ""' to old file Adding new line '"56614" "dummy" "l3" "l0" "notes" "dummy +" ""' to old file Adding new line '"17160" "dummy" "r0" "r2" "notes" "dummy +" ""' to old file Adding new line '"72753" "dummy" "r3" "r6" "notes" "dummy +" ""' to old file

    Good luck.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Comparing files and and elements of arrays
by Anonymous Monk on Oct 03, 2003 at 10:51 UTC
    Hi, thanks for the help thats been given so far, im in the process of understanding the code thats been given.
    heres the code i have done so far...probably not the best or most efficient code there is. It probably has a few minor bugs to get it to output where i am at so far...it starts from where i have divided the contents of the input file into approved and newly added (or unapproved), here it is.

    #!/usr/bin/perl my $filename02 = "unapproved.txt"; my $filename03 = "approved.txt"; my $filename04 = "compared.txt"; open(UNAPPROVED,"+<$filename02") || die "Can't open file $filename02"; + # open destination text file open(APPROVED,"+<$filename03") || die "Can't open file $filename03"; # + open destination text file open(COMPARED,"+>$filename04") || die "Can't open file $filename04"; # + open destination text file my @unapprovedline; my @approvedline; @approvedall = <APPROVED>; @unapprovedall = <UNAPPROVED>; close APPROVED; close UNAPPROVED; $numoflines01 = @unapprovedall; $numoflines02 = @approvedall; for ($x = 0; $x < $numoflines01; $x++) { $line01 = @unapprovedall[$x]; @unapprovedline = split (/\"/, $line01); for ($y = 0; $y < $numoflines02; $y++) { $line02 = @approvedall[$y]; @approvedline = split (/\"/, $line02); $uleft = @unapprovedline[5]; $aleft = @approvedline[5]; $uright = @unapprovedline[7]; $aright = @approvedline[7]; $uping = @unapprovedline[13]; $aping = @approvedline[13]; if ($uping eq $aping && $uleft eq $aleft && $uright eq $aright +) { print COMPARED "\"", @approvedline[1], "\"", "\t"; print COMPARED "\"", @approvedline[3], "\"", "\t"; print COMPARED "\"", @approvedline[5], "\"", "\t"; print COMPARED "\"", @approvedline[7], "\"", "\t"; print COMPARED "\"", "@approvedline[9] @unapprovedline[1], +", "\"", "\t"; print COMPARED "\"", @approvedline[11], "\"", "\t"; print COMPARED "\"", @approvedline[13], "\"", "\t", "\n"; } else { print COMPARED "\"", @unapprovedline[1], "\"", "\t"; print COMPARED "\"", @unapprovedline[3], "\"", "\t"; print COMPARED "\"", @unapprovedline[5], "\"", "\t"; print COMPARED "\"", @unapprovedline[7], "\"", "\t"; print COMPARED "\"", @unapprovedline[9], "\"", "\t"; print COMPARED "\"", @unapprovedline[11], "\"", "\t"; print COMPARED "\"", @unapprovedline[13], "\"", "\t", "\n" +; } } } close COMPARED; # delete UNAPPROVED

    Heres the file that is produced from the above code...

    "A1000000" "" "DB50F-LEFT" "DB50F-RIGHT" "" "" + "L1:R1 L2:R2 L3:R3 L4:R4 L5:R5 L6:R6 L7:R7" "M50DFFU" "" "DB50F-LEFT" "DB50F-RIGHT" " 1000000," + "" "L1:R1 L2:R2 L3:R3 L4:R4 L5:R5 L6:R6 L7:R7" "A1000000" "" "DB50F-LEFT" "DB50F-RIGHT" "" "" + "L1:R1 L2:R2 L3:R3 L4:R4 L5:R5 L6:R6 L7:R7" "A1000000" "" "DB50F-LEFT" "DB50F-RIGHT" "" "" + "L1:R1 L2:R2 L3:R3 L4:R4 L5:R5 L6:R6 L7:R7" "A1000000" "" "DB50F-LEFT" "DB50F-RIGHT" "" "" + "L1:R1 L2:R2 L3:R3 L4:R4 L5:R5 L6:R6 L7:R7" "P000001" "" "DB9F-LEFT" "DB9F-RIGHT" " 5000001," " +" "L1:R1 L2:R2 L6:R6" "A5000001" "" "DB9F-LEFT" "DB9F-RIGHT" "" "" + "L1:R1 L2:R2 L6:R6" "A5000001" "" "DB9F-LEFT" "DB9F-RIGHT" "" "" + "L1:R1 L2:R2 L6:R6" "A5000001" "" "DB9F-LEFT" "DB9F-RIGHT" "" "" + "L1:R1 L2:R2 L6:R6" "A5000001" "" "DB9F-LEFT" "DB9F-RIGHT" "" "" + "L1:R1 L2:R2 L6:R6" "A9999999" "" "DB00F-LEFT" "DB00F-RIGHT" "" "" + "L1:R1 L3:R3 L4:R4 L6:R6" "A9999999" "" "DB00F-LEFT" "DB00F-RIGHT" "" "" + "L1:R1 L3:R3 L4:R4 L6:R6" "A9999999" "" "DB00F-LEFT" "DB00F-RIGHT" "" "" + "L1:R1 L3:R3 L4:R4 L6:R6" "A9999999" "" "DB00F-LEFT" "DB00F-RIGHT" "" "" + "L1:R1 L3:R3 L4:R4 L6:R6" "A9999999" "" "DB00F-LEFT" "DB00F-RIGHT" "" "" + "L1:R1 L3:R3 L4:R4 L6:R6"

    And this is what i im trying to do with the results produced above...again this is probably a very long way of solving this problem but i cant really think of any other way to do it as my perl knowledge is limited

    # for each line, get the part no., compare to contents of # the notes field on every other line, if it is in the # notes field, delete the line otherwise leave and move on # to the next line. Finally delete all dupes and merge the # list with the approved.txt open(COMPARED,"+<$filename04") || die "Can't open file $filename04"; # + open destination text file @comparedall = <COMPARED>; $numoflines03 = @comparedall; for ($a = 0; $a < $numoflines03; $a++) { $line03 = @comparedall[$a]; @comparedline = split (/\"/, $line03); $part = @comparedline[1]; $notes = @comparedline[9]; @notes = split (/\,/, $notes); $numofnotes = @notes; for ($b = 0; $b <= $numofnotes; $b++) { $notescontent = @notes[$b]; print $notescontent, "\n"; if ($part eq $notescontent) { print "hello"; } } }

    Thanks for everyones help, Steve.