Roguehorse has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm having a problem sorting between two files that I hope someone can help me with as I'm still pretty new to Perl.

I need to separate all values where key 1 first page third column is != key 2 second page third column and then print those lines to another external page sequentially from the two.

The trick is that the file has 90k+ entries and line 1 may have duplicates such as 'Scott' more than once which I would like to keep, only eliminating those values where 'Scott' shows in line key 1 & 2 of the same second column value.

A sample of the first file looks like this:

The second file looks like this:

My desired results would be this:

I've been through many pages of @array, %hash and awk trying to modify it to suit my needs but have not been able to get a working solution yet as most of them are only trying to remove one duplicate line.

A little help PLEASE!

Replies are listed 'Best First'.
Re: Sorting By Column
by AppleFritter (Vicar) on Jul 10, 2014 at 17:00 UTC

    Read the files line-by-line, split each pair of lines along whitespace, compare the relevant field, and output the lines if they differ:

    #!/usr/bin/perl use feature qw/say/; use warnings; use strict; open my $file1, "<", "test1.txt" or die "Could not open first file: $! +\n"; open my $file2, "<", "test2.txt" or die "Could not open second file: $ +!\n"; while(my $line1 = <$file1>) { my $line2 = <$file2>; chomp ($line1, $line2); my ($key1, $num1, $str1) = split /\s+/, $line1; my ($key2, $num2, $str2) = split /\s+/, $line2; if($str1 ne $str2) { say $line1; say $line2; } } close $file1 or die "Could not close first file: $!\n"; close $file2 or die "Could not close second file: $!\n";

    There's bound to be shorter, more idiomatic ways of achieving the same thing, but since your userpage indicates you're still new to Perl, you may find this the most instructive/useful.

    On an unrelated side note, could you edit your post to use <code> tags for your sample data? See the following nodes for more information on formatting etc.:

      I could be wrong, but I think your while loop is going to have problems if the two specified files do not have the same number of lines. If the first file has less lines than the second file, your code won't fully process all lines in the second file. If the first file has more lines than the second file, your loop will try to read in more lines from the second file than what it has.

        You're right -- based on the sample data the OP shared, I assumed that both files would have the same number of lines. Thanks for pointing this out, I should have mentioned it explicitely.

      This worked PERFECTLY!

      Thanks a billion!

        *tips hat* You're welcome!

        BTW, I should also point out that this ignores the second field (the numbers) entirely, as with the sample data you supplied, there is never a situation where these don't match. Depending on whether this is also the case for your actual data, you may still want to adjust the above code.

Re: Sorting By Column
by dasgar (Priest) on Jul 10, 2014 at 18:58 UTC

    I'm not sure that I totally understand what you're trying to do. It kind of sounds like you're merging multiple lists (from different files) and then wanting to do multiple column sorting of the data. Assuming that's correct, here's how I personally would approach the problem.

    First, read in each file and put the data into a multidimensional array (AoA). After that, you could use something like Data::Table sort by the second column (numerically ascending) and then by the first column (numerically ascending). Once you have it sorted, just print it out to the new file.

Re: Sorting By Column
by Anonymous Monk on Jul 10, 2014 at 18:05 UTC
    Your question is pretty hard to understand, to be honest. Maybe try to ask again, in a different way. Give some examples of this:
    The trick is that the file has 90k+ entries and line 1 may have duplicates such as 'Scott' more than once which I would like to keep, only eliminating those values where 'Scott' shows in line key 1 & 2 of the same second column value
    Your samples don't have any duplicates as far as I can tell.

      You're right, there are no duplicates in the sample, however there ARE duplicates in the original file and thus only duplicates that share the same column 2 and column 3 value and are distinguished between key 1 and key 2 should be removed

      The idea is NOT to remove:

        1 23456 Scott
        2 23456 Stapp
        1 56789 Scott
        2 56789 Miller

      But DO REMOVE:

        1 76543 Miller
        2 76543 Miller
        1 33446 Scott
        2 33446 Scott

      Does this clarify the question a bit?

Re: Sorting By Column
by oiskuu (Hermit) on Jul 10, 2014 at 22:04 UTC

    If file identity in column 1 were omitted, you could use the unix comm utility to eliminate duplicates.

    See if this works to your satisfaction (give input files as arguments):

    #! /usr/bin/perl my %X; for (<>) { my ($i,$k,$n) = split(" ",$_,3); $X{$k.$n}{$k.$i} = $_; } for (sort {$a->[0] cmp $b->[0]} map [%$_], values %X) { print $_->[1] if @$_ == 2; }

Re: Sorting By Column
by Laurent_R (Canon) on Jul 10, 2014 at 19:05 UTC
    Sorry, I also don't really understand what you are trying to do, but only have a relatively vague idea. Perhaps it would be clearer if you showed your coding attempts.