Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, So I have a csv file which looks like this:
A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4,
Im concerned with column 1 and 2 only. So as you can see, in row 1 there are two unique ids only -A & B. So for each of the unique ids I want to save the rows which have shortest string in column two. So for above input the output would look like this:
B, textt, col3, col4, A, text, col3, col4,

Replies are listed 'Best First'.
Re: Find the row with shortest string for a given input in a csv file.
by AppleFritter (Vicar) on Jul 28, 2014 at 12:10 UTC

    Read the file using a CPAN module (e.g. Text::CSV), and keep a hash or array that, for each unique ID, records the length of the string in column 2, and the entire corresponding row. (A hash would be more natural, I think, since you could index it by unique ID; an array would allow you to easily preserve the ordering of rows from the original file, in case that's important).

    Here's a hash-based solution:

    #!/usr/bin/perl use strict; use warnings; use feature qw/say/; use Text::CSV; my $csv = Text::CSV->new( { binary => 1 }) or die "Cannot use CSV" . Text::CSV->error_diag(); my %results = (); while(<DATA>) { chomp; $csv->parse($_) or die "Could not parse string '$_'" . Text::CSV-> +error_diag(); my @row = $csv->fields(); my $uniqueID = $row[0]; my $string = $row[1]; if(!exists $results{$uniqueID} or $results{$uniqueID}->{'length'} +> length $string) { $results{$uniqueID} = { 'length' => length $string, 'row' => $_ }; } } foreach (sort keys %results) { say $results{$_}->{'row'}; } __DATA__ A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4,

    I'm reading from __DATA__ here; to use an external file, simply use the magic filehandle, <>, instead of <DATA>. This'll allow you to specify files on the command line as well as pipe them into the script:

    $ perl script.pl data.csv ... $ generate_csv | perl script.pl ... $

    Side note -- I see you crossposted your question to StackOverflow. That's fine, of course, but it's generally considered polite to inform people of crossposting to avoid duplicated/unnecessary effort.

      Do not use parse (it'll break your script on fields with newlines). Use getline instead!

      Use auto_diag

      I seriously doubt if all the whitespace should be counted in the length function

      use 5.12.2; use warnings; use Text::CSV; my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, allow_whitesp +ace => 1 }); my %results; while (my $row = $csv->getline (*DATA)) { my $uniqueID = $row->[0]; my $string = $row->[1]; $results{$uniqueID}{len} // 9999 <= length $string and next; $results{$uniqueID} = { len => length $string, row => $row, }; } $csv->eol ("\n"); $csv->print (*STDOUT, $results{$_}{row}) for sort keys %results; __DATA__ A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4,

      Enjoy, Have FUN! H.Merijn

        Do not use parse (it'll break your script on fields with newlines). Use getline instead!

        Ah, good point. Funny, my first iteration of the script actually used ->getline(), but then I reckoned that in $csv->getline(*DATA) couldn't be generalized so easily to the magic filehandle. I didn't want to sacrifice the convenience of not having to explicitely open files; the issue with newlines didn't occur to me, but you're right. The devil is in the details...

        Looking at perlop now, it also turns out that <> is actually just a shorthand for <ARGV> (which is just as magic): you can write $csv->getline(*ARGV) and still have everything Just Work™, both piping data into the script and supplying a filename (or several) on the command line.

        Thanks for enlightening me, brother!

      Thanks AF, I just deletd that post. I should have noted your pointed . Thanks

        No worries. As I said, crossposting is fine as long as you let people know.

        Regarding your original question again, if it's important that the order of lines of the original file be preserved, I think it's actually better to augment the hash to hold line numbers instead of using an array, as otherwise you'd have to grep through all previous results in each step to make sure you've not already seen a given unique ID (essentially making the whole loop O(n^2) rather than O(n) with regard to the number of lines in your file).

        The hash-based solution is easily augmented to accomplish this:

        my %results = (); my $position = 0; while(...) { ... if(...) { $results{$uniqueID} = { ... 'position' => $position++, ... foreach (sort { $results{$a}->{'position'} <=> $results{$b}->{'positio +n'} } keys %results) { ...
Re: Find the row with shortest string for a given input in a csv file.
by Laurent_R (Canon) on Jul 28, 2014 at 21:43 UTC
    Hmm, I realize that some monks might object to that, but do we really need to use Text::CSV when the whole thing can be done in 4 lines of actual code?
    use strict; use warnings; use feature qw/say/; my %results; while (<DATA>) { my ($id, $string) = (split /[,\s]+/)[0,1]; next if defined $results{$id} and length $string > length $results +{$id}; $results{$id} = $string; } say "$_ $results{$_}" for sort keys %results; __DATA__ A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4,
    Or, possibly one more code line if we really want to cache the length in the hash:
    # ... while (<DATA>) { my ($id, $string) = (split /[,\s]+/)[0,1]; my $cur_len = length $string; next if defined $results{$id} and $cur_len > $results{$id}{len}; $results{$id} = { str => $string, len => $cur_len }; } say "$_ $results{$_}{str}" for sort keys %results; #...
      ...do we really need to use Text::CSV when the whole thing can be done in 4 lines of actual code?

      I appreciate this attitude. While we should be giving options, and educating the OPs to some extent, I don't see the benefit here of installing a module hierarchy for a problem this small. There are additional problems newbies and old hats both face when installing modules, plus the added burden of dependencies.

      I guess I'm a bit surprised that it took this many responses to get to "here it is in 4 lines of code, without a module".

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        I guess I'm a bit surprised that it took this many responses to get to "here it is in 4 lines of code, without a module".

        FAQ fatigue is real :)