Find the row with shortest string for a given input in a csv file.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Find the row with shortest string for a given input in a csv file. by AppleFritter (Vicar) on Jul 28, 2014 at 12:10 UTC
Read the file using a CPAN module (e.g. Text::CSV), and keep a hash or array that, for each unique ID, records the length of the string in column 2, and the entire corresponding row. (A hash would be more natural, I think, since you could index it by unique ID; an array would allow you to easily preserve the ordering of rows from the original file, in case that's important). Here's a hash-based solution: #!/usr/bin/perl use strict; use warnings; use feature qw/say/; use Text::CSV; my $csv = Text::CSV->new( { binary => 1 }) or die "Cannot use CSV" . Text::CSV->error_diag(); my %results = (); while(<DATA>) { chomp; $csv->parse($_) or die "Could not parse string '$_'" . Text::CSV-> +error_diag(); my @row = $csv->fields(); my $uniqueID = $row[0]; my $string = $row[1]; if(!exists $results{$uniqueID} or $results{$uniqueID}->{'length'} +> length $string) { $results{$uniqueID} = { 'length' => length $string, 'row' => $_ }; } } foreach (sort keys %results) { say $results{$_}->{'row'}; } __DATA__ A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4, [download] I'm reading from `__DATA__` here; to use an external file, simply use the magic filehandle, `<>`, instead of `<DATA>`. This'll allow you to specify files on the command line as well as pipe them into the script: `$ perl script.pl data.csv ... $ generate_csv \| perl script.pl ... $` [download] Side note -- I see you crossposted your question to StackOverflow. That's fine, of course, but it's generally considered polite to inform people of crossposting to avoid duplicated/unnecessary effort.	[reply] [d/l] [select]
Re^2: Find the row with shortest string for a given input in a csv file. by Tux (Canon) on Jul 28, 2014 at 15:54 UTC
Do not use `parse` (it'll break your script on fields with newlines). Use `getline` instead! Use `auto_diag` I seriously doubt if all the whitespace should be counted in the `length` function use 5.12.2; use warnings; use Text::CSV; my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, allow_whitesp +ace => 1 }); my %results; while (my $row = $csv->getline (DATA)) { my $uniqueID = $row->[0]; my $string = $row->[1]; $results{$uniqueID}{len} // 9999 <= length $string and next; $results{$uniqueID} = { len => length $string, row => $row, }; } $csv->eol ("\n"); $csv->print (STDOUT, $results{$_}{row}) for sort keys %results; __DATA__ A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4, [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Find the row with shortest string for a given input in a csv file. by AppleFritter (Vicar) on Jul 28, 2014 at 18:34 UTC
Do not use parse (it'll break your script on fields with newlines). Use getline instead! Ah, good point. Funny, my first iteration of the script actually used `->getline()`, but then I reckoned that in `$csv->getline(DATA)` couldn't be generalized so easily to the magic filehandle. I didn't want to sacrifice the convenience of not having to explicitely open files; the issue with newlines didn't occur to me, but you're right. The devil is in the details... Looking at perlop now, it also turns out that `<>` is actually just a shorthand for `<ARGV>` (which is just as magic): you can* write `$csv->getline(*ARGV)` and still have everything Just Work™, both piping data into the script and supplying a filename (or several) on the command line. Thanks for enlightening me, brother!	[reply]
Re^2: Find the row with shortest string for a given input in a csv file. by Anonymous Monk on Jul 28, 2014 at 12:52 UTC
Thanks AF, I just deletd that post. I should have noted your pointed . Thanks	[reply]
Re^3: Find the row with shortest string for a given input in a csv file. by AppleFritter (Vicar) on Jul 28, 2014 at 13:06 UTC
No worries. As I said, crossposting is fine as long as you let people know. Regarding your original question again, if it's important that the order of lines of the original file be preserved, I think it's actually better to augment the hash to hold line numbers instead of using an array, as otherwise you'd have to grep through all previous results in each step to make sure you've not already seen a given unique ID (essentially making the whole loop O(n^2) rather than O(n) with regard to the number of lines in your file). The hash-based solution is easily augmented to accomplish this: `my %results = (); my $position = 0; while(...) { ... if(...) { $results{$uniqueID} = { ... 'position' => $position++, ... foreach (sort { $results{$a}->{'position'} <=> $results{$b}->{'positio +n'} } keys %results) { ...` [download]	[reply] [d/l]
Re^4: Find the row with shortest string for a given input in a csv file. by bigj (Monk) on Jul 28, 2014 at 14:22 UTC
Re: Find the row with shortest string for a given input in a csv file. by Laurent_R (Canon) on Jul 28, 2014 at 21:43 UTC
Hmm, I realize that some monks might object to that, but do we really need to use Text::CSV when the whole thing can be done in 4 lines of actual code? `use strict; use warnings; use feature qw/say/; my %results; while (<DATA>) { my ($id, $string) = (split /[,\s]+/)[0,1]; next if defined $results{$id} and length $string > length $results +{$id}; $results{$id} = $string; } say "$_ $results{$_}" for sort keys %results; __DATA__ A, texttexttext, col3, col4, B, textt, col3, col4, A, text, col3, col4, B, texttex, col3, col4,` [download] Or, possibly one more code line if we really want to cache the length in the hash: `# ... while (<DATA>) { my ($id, $string) = (split /[,\s]+/)[0,1]; my $cur_len = length $string; next if defined $results{$id} and $cur_len > $results{$id}{len}; $results{$id} = { str => $string, len => $cur_len }; } say "$_ $results{$_}{str}" for sort keys %results; #...` [download]	[reply] [d/l] [select]
Re^2: Find the row with shortest string for a given input in a csv file. by QM (Parson) on Jul 29, 2014 at 08:21 UTC
...do we really need to use Text::CSV when the whole thing can be done in 4 lines of actual code? I appreciate this attitude. While we should be giving options, and educating the OPs to some extent, I don't see the benefit here of installing a module hierarchy for a problem this small. There are additional problems newbies and old hats both face when installing modules, plus the added burden of dependencies. I guess I'm a bit surprised that it took this many responses to get to "here it is in 4 lines of code, without a module". -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^3: Find the row with shortest string for a given input in a csv file. by Anonymous Monk on Jul 29, 2014 at 08:32 UTC
I guess I'm a bit surprised that it took this many responses to get to "here it is in 4 lines of code, without a module". FAQ fatigue is real :)	[reply]