Re: Find the row with shortest string for a given input in a csv file.

Read the file using a CPAN module (e.g. Text::CSV), and keep a hash or array that, for each unique ID, records the length of the string in column 2, and the entire corresponding row. (A hash would be more natural, I think, since you could index it by unique ID; an array would allow you to easily preserve the ordering of rows from the original file, in case that's important).

Here's a hash-based solution:

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;

use Text::CSV;

my $csv = Text::CSV->new( { binary => 1 })
    or die "Cannot use CSV" . Text::CSV->error_diag();

my %results = ();

while(<DATA>) {
    chomp;

    $csv->parse($_) or die "Could not parse string '$_'" . Text::CSV->
+error_diag();
    my @row = $csv->fields();

    my $uniqueID = $row[0];
    my $string   = $row[1];

    if(!exists $results{$uniqueID} or $results{$uniqueID}->{'length'} 
+> length $string) {
        $results{$uniqueID} = {
            'length'    => length $string,
            'row'       => $_
        };
    }
}

foreach (sort keys %results) {
    say $results{$_}->{'row'};
}


__DATA__
A, texttexttext, col3, col4,
B, textt,        col3, col4,
A, text,         col3, col4,
B, texttex,      col3, col4,
[download]

I'm reading from __DATA__ here; to use an external file, simply use the magic filehandle, <>, instead of <DATA>. This'll allow you to specify files on the command line as well as pipe them into the script:

$ perl script.pl data.csv
...
$ generate_csv | perl script.pl
...
$
[download]

Side note -- I see you crossposted your question to StackOverflow. That's fine, of course, but it's generally considered polite to inform people of crossposting to avoid duplicated/unnecessary effort.

Comment on Re: Find the row with shortest string for a given input in a csv file. Select or Download Code

Replies are listed 'Best First'.
Re^2: Find the row with shortest string for a given input in a csv file. by Tux (Canon) on Jul 28, 2014 at 15:54 UTC
Do not use `parse` (it'll break your script on fields with newlines). Use `getline` instead! Use `auto_diag` I seriously doubt if all the whitespace should be counted in the `length` function use 5.12.2; use warnings; use Text::CSV; my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, allow_whitesp +ace => 1 }); my %results; while (my $row = $csv->getline (DATA)) { my $uniqueID = $row->[0]; my $string = $row->[1]; $results{$uniqueID}{len} // 9999 <= length $string and next; $results{$uniqueID} = { len => length $string, row => $row, }; } $csv->eol ("\n"); $csv->print (STDOUT, $results{$_}{row}) for sort keys %results; __DATA__ A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4, [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Find the row with shortest string for a given input in a csv file. by AppleFritter (Vicar) on Jul 28, 2014 at 18:34 UTC
Do not use parse (it'll break your script on fields with newlines). Use getline instead! Ah, good point. Funny, my first iteration of the script actually used `->getline()`, but then I reckoned that in `$csv->getline(DATA)` couldn't be generalized so easily to the magic filehandle. I didn't want to sacrifice the convenience of not having to explicitely open files; the issue with newlines didn't occur to me, but you're right. The devil is in the details... Looking at perlop now, it also turns out that `<>` is actually just a shorthand for `<ARGV>` (which is just as magic): you can* write `$csv->getline(*ARGV)` and still have everything Just Work™, both piping data into the script and supplying a filename (or several) on the command line. Thanks for enlightening me, brother!	[reply]
Re^2: Find the row with shortest string for a given input in a csv file. by Anonymous Monk on Jul 28, 2014 at 12:52 UTC
Thanks AF, I just deletd that post. I should have noted your pointed . Thanks	[reply]
Re^3: Find the row with shortest string for a given input in a csv file. by AppleFritter (Vicar) on Jul 28, 2014 at 13:06 UTC
No worries. As I said, crossposting is fine as long as you let people know. Regarding your original question again, if it's important that the order of lines of the original file be preserved, I think it's actually better to augment the hash to hold line numbers instead of using an array, as otherwise you'd have to grep through all previous results in each step to make sure you've not already seen a given unique ID (essentially making the whole loop O(n^2) rather than O(n) with regard to the number of lines in your file). The hash-based solution is easily augmented to accomplish this: `my %results = (); my $position = 0; while(...) { ... if(...) { $results{$uniqueID} = { ... 'position' => $position++, ... foreach (sort { $results{$a}->{'position'} <=> $results{$b}->{'positio +n'} } keys %results) { ...` [download]	[reply] [d/l]
Re^4: Find the row with shortest string for a given input in a csv file. by bigj (Monk) on Jul 28, 2014 at 14:22 UTC
Or just use Tie::IxHash to have an ordererd assoziative array :-) instead of tracking it on our own. Greetings, Janek Schleicher	[reply]