Sorting By Column

Roguehorse has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm having a problem sorting between two files that I hope someone can help me with as I'm still pretty new to Perl.

I need to separate all values where key 1 first page third column is != key 2 second page third column and then print those lines to another external page sequentially from the two.

The trick is that the file has 90k+ entries and line 1 may have duplicates such as 'Scott' more than once which I would like to keep, only eliminating those values where 'Scott' shows in line key 1 & 2 of the same second column value.

A sample of the first file looks like this:

1 30595 Taylor

1 30596 Scott

1 30597 Charlie

1 30598 NealpNealevak

1 30599 DiemIrramma

1 30600 Seenstutt

1 30601 aledgeargueft

1 30602 VarlFrilili

1 30603 Chang

1 30604 Antonietta

1 30605 Darrell

1 30606 Jack

1 30607 Madeline

1 30608 Liam

1 30609 Bert

The second file looks like this:

2 30595 Nutt

2 30596 Smart

2 30597 Yee

2 30598 NealpNealevak

2 30599 DiemIrramma

2 30600 Seenstutt

2 30601 aledgeargueft

2 30602 VarlFrilili

2 30603 Earley

2 30604 Sperry

2 30605 Whitehurst

2 30606 Mount

2 30607 Thao

2 30608 Melancon

2 30609 Christensen

My desired results would be this:

1 30595 Taylor

2 30595 Nutt

1 30596 Scott

2 30596 Smart

1 30597 Charlie

2 30597 Yee

1 30603 Chang

2 30603 Earley

1 30604 Antonietta

2 30604 Sperry

1 30605 Darrell

2 30605 Whitehurst

1 30606 Jack

2 30606 Mount

1 30607 Madeline

2 30607 Thao

1 30608 Liam

2 30608 Melancon

1 30609 Bert

2 30609 Christensen

I've been through many pages of @array, %hash and awk trying to modify it to suit my needs but have not been able to get a working solution yet as most of them are only trying to remove one duplicate line.

A little help PLEASE!

Comment on Sorting By Column

Replies are listed 'Best First'.
Re: Sorting By Column by AppleFritter (Vicar) on Jul 10, 2014 at 17:00 UTC
Read the files line-by-line, split each pair of lines along whitespace, compare the relevant field, and output the lines if they differ: #!/usr/bin/perl use feature qw/say/; use warnings; use strict; open my $file1, "<", "test1.txt" or die "Could not open first file: $! +\n"; open my $file2, "<", "test2.txt" or die "Could not open second file: $ +!\n"; while(my $line1 = <$file1>) { my $line2 = <$file2>; chomp ($line1, $line2); my ($key1, $num1, $str1) = split /\s+/, $line1; my ($key2, $num2, $str2) = split /\s+/, $line2; if($str1 ne $str2) { say $line1; say $line2; } } close $file1 or die "Could not close first file: $!\n"; close $file2 or die "Could not close second file: $!\n"; [download] There's bound to be shorter, more idiomatic ways of achieving the same thing, but since your userpage indicates you're still new to Perl, you may find this the most instructive/useful. On an unrelated side note, could you edit your post to use `<code>` tags for your sample data? See the following nodes for more information on formatting etc.: Writeup Formatting Tips Markup in the Monastery Perl Monks Approved HTML tags What shortcuts can I use for linking to other information?	[reply] [d/l]
Re^2: Sorting By Column by dasgar (Priest) on Jul 10, 2014 at 19:02 UTC
I could be wrong, but I think your while loop is going to have problems if the two specified files do not have the same number of lines. If the first file has less lines than the second file, your code won't fully process all lines in the second file. If the first file has more lines than the second file, your loop will try to read in more lines from the second file than what it has.	[reply]
Re^3: Sorting By Column by AppleFritter (Vicar) on Jul 10, 2014 at 19:05 UTC
You're right -- based on the sample data the OP shared, I assumed that both files would have the same number of lines. Thanks for pointing this out, I should have mentioned it explicitely.	[reply]
Re^4: Sorting By Column by Roguehorse (Initiate) on Jul 10, 2014 at 21:12 UTC
Re^2: Sorting By Column by Roguehorse (Initiate) on Jul 10, 2014 at 22:21 UTC
This worked PERFECTLY! Thanks a billion!	[reply]
Re^3: Sorting By Column by AppleFritter (Vicar) on Jul 11, 2014 at 00:27 UTC
tips hat You're welcome! BTW, I should also point out that this ignores the second field (the numbers) entirely, as with the sample data you supplied, there is never a situation where these don't match. Depending on whether this is also the case for your actual data, you may still want to adjust the above code.	[reply]
Re: Sorting By Column by dasgar (Priest) on Jul 10, 2014 at 18:58 UTC
I'm not sure that I totally understand what you're trying to do. It kind of sounds like you're merging multiple lists (from different files) and then wanting to do multiple column sorting of the data. Assuming that's correct, here's how I personally would approach the problem. First, read in each file and put the data into a multidimensional array (AoA). After that, you could use something like Data::Table sort by the second column (numerically ascending) and then by the first column (numerically ascending). Once you have it sorted, just print it out to the new file.	[reply]
Re: Sorting By Column by Anonymous Monk on Jul 10, 2014 at 18:05 UTC
Your question is pretty hard to understand, to be honest. Maybe try to ask again, in a different way. Give some examples of this: The trick is that the file has 90k+ entries and line 1 may have duplicates such as 'Scott' more than once which I would like to keep, only eliminating those values where 'Scott' shows in line key 1 & 2 of the same second column value Your samples don't have any duplicates as far as I can tell.	[reply]
Re^2: Sorting By Column by Roguehorse (Initiate) on Jul 10, 2014 at 21:27 UTC
You're right, there are no duplicates in the sample, however there ARE duplicates in the original file and thus only duplicates that share the same column 2 and column 3 value and are distinguished between key 1 and key 2 should be removed The idea is NOT to remove: 1 23456 Scott 2 23456 Stapp 1 56789 Scott 2 56789 Miller But DO REMOVE: 1 76543 Miller 2 76543 Miller 1 33446 Scott 2 33446 Scott Does this clarify the question a bit?	[reply]
Re: Sorting By Column by oiskuu (Hermit) on Jul 10, 2014 at 22:04 UTC
If file identity in column 1 were omitted, you could use the unix comm utility to eliminate duplicates. See if this works to your satisfaction (give input files as arguments): `#! /usr/bin/perl my %X; for (<>) { my ($i,$k,$n) = split(" ",$_,3); $X{$k.$n}{$k.$i} = $_; } for (sort {$a->[0] cmp $b->[0]} map [%$_], values %X) { print $_->[1] if @$_ == 2; }` [download]	[reply] [d/l]
Re: Sorting By Column by Laurent_R (Canon) on Jul 10, 2014 at 19:05 UTC
Sorry, I also don't really understand what you are trying to do, but only have a relatively vague idea. Perhaps it would be clearer if you showed your coding attempts.	[reply]

Back to Seekers of Perl Wisdom