temporal has asked for the wisdom of the Perl Monks concerning the following question:

Sometimes I want to find the longest string under a particular column within a rather large, unsorted CSV file (several GB).

This can be useful to know for specifying sane database column size restrictions, etc. I've also found myself doing variations on this theme - wanting to do a quick and dirty comparison or operation based on an attribute of a particular column or columns in a delimited file.

Typically I'll do something like this:

perl -F, -lane 'print $t = length $F[0] <= $t ? next LINE : length $F[0]' file.csv

Just wondering if anyone has a cleaner one-liner to accomplish this. There's got to be a sexier way. Or maybe just a more efficient way. Any ideas?

Strange things are afoot at the Circle-K.

Replies are listed 'Best First'.
Re: Delimited File Analysis One-Liner?
by BrowserUk (Patriarch) on May 01, 2012 at 20:51 UTC

    My guess is that this would be fastest for your specific example:

    perl -nE"$l=index$_,',';$m<$l and $m=$l}{ say $m"

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Thanks BrowserUk, that's a clever way to do it!

      Didn't know that Perl allows you to close the brackets like that, either.

      Strange things are afoot at the Circle-K.
        Didn't know that Perl allows you to close the brackets like that, either.

        Do a super search for "secret operators" and "eskimo greeting".

      Thanks for the tip, BrowserUk. Always fun to learn a new trick.

      I've generalized your code to match any column, where i is the column:

      perl -F, -anE '$m<($x=length $F[i]) and $m=$x}{say $m' file.csv

      I'd still use yours if I'm only looking in the first column. I wonder if there's a way to continue on that same idea (using index) and count delimiters out to a particular column and then save the distance between the most recent delimiters. Probably wouldn't be a one-liner at that point.

      Also, curious about your use of $1. Is it a shell variable? Executing your command as written (in bash) doesn't give me any output, I have to switch the single and double quotes. Then I have to use a different variable, Perl won't let me assign to $1.

      Strange things are afoot at the Circle-K.
        I wonder if there's a way to continue on that same idea (using index) and count delimiters out to a particular column

        No. Beyond the first column, -aF, is about as efficient as it gets.

        Also, curious about your use of $1.

        You need to get a better font! It isn't $1, (one) but rather $l (for length) and $m (for max).

        With any reasonable font they should be distinct, but I see it was a bad chioce for posting here.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Delimited File Analysis One-Liner?
by Anonymous Monk on May 01, 2012 at 20:33 UTC
    perl -lne 'print $t = do{ /,/; $l = $-[0] } <= $t ? next : $l'

    Efficient? - Yes.
    Sexier? - Not really.
      A faster way:

      perl -lne 'print $t = $l if ($l=index $_,q{,}) > $t' file.csv