in reply to Re^3: Sort Large Files
in thread Sort Large Files

file="" file="data1 data2" file="-s squeeze-my-blanks"
Did you actually try those? I suspect not, because you'd need an "eval" in your script to get the shell to do another round of whitespace parsing after the variable is interpolated, and that's a Very Good Thing. As it is, you'd be trying to cat the current directory (garbage in garbage out), a file named "data1 data2" and probably get some switch violation because it'd be a single weird switch with a lot of odd chars in it.
In this particular pipeline, $file could have been placed after the perl command (if you can assume $file doesn't have a switch for cat). However, that would place the data to act on somewhere in the middle of the pipeline. Which I find harder to understand. Flow should go from right to left, left to right, top to bottom, or bottom to top. But not middle, left, right. cat is short, just three letters, which places the data nearly at the beginning.
Nobody is saying violate the order. Write it like this if you want it left to right:
< $file \ perl ... | other ... | nextthing ... | and_so_on ...
I hand out the Useless Use of Cat Award precisely because of code like yours, where a cat is indeed completely useless.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Replies are listed 'Best First'.
Re^5: Sort Large Files
by Anonymous Monk on Jan 06, 2005 at 14:40 UTC

    Did you actually try those?

    Yes I did. Did you?

    $ echo "hello" > data1 $ echo "world" > data2 $ file="data1 data2" $ cat $file | wc -l 2 $ < $file | wc -l bash: $file: ambiguous redirect 0

    I suspect not, because you'd need an "eval" in your script to get the shell to do another round of whitespace parsing after the variable is interpolated, and that's a Very Good Thing.

    Really? I've been writing constructs of the form:

    FILES="file1 file2 file3 file4 file5" for file in $FILES do ... something with $file ... done
    for a couple of decades. And now you're telling me it never worked???? Now, you're free to believe me, but may I quote from the beginning of perl's Configure?
    paths='/bin /usr/bin /usr/local/bin /usr/ucb /usr/local /usr/lbin' paths="$paths /opt/bin /opt/local/bin /opt/local /opt/lbin" paths="$paths /usr/5bin /etc /usr/gnu/bin /usr/new /usr/new/bin /usr/n +bin" paths="$paths /opt/gnu/bin /opt/new /opt/new/bin /opt/nbin" paths="$paths /sys5.3/bin /sys5.3/usr/bin /bsd4.3/bin /bsd4.3/usr/ucb" paths="$paths /bsd4.3/usr/bin /usr/bsd /bsd43/bin /usr/ccs/bin" paths="$paths /etc /usr/lib /usr/ucblib /lib /usr/ccs/lib" paths="$paths /sbin /usr/sbin /usr/libexec" paths="$paths /system/gnu_library/bin" for p in $paths do case "$p_$PATH$p_" in *$p_$p$p_*) ;; *) test -d $p && PATH=$PATH$p_$p ;; esac done
    No extra eval happening here.

    I hand out the Useless Use of Cat Award precisely because of code like yours, where a cat is indeed completely useless.

    Well, my code works and your suggested alternative does not work. So I get two things: an award, and working code. Good for me.

      If this makes two files named "x" and "y", and not one file named "x space y":
      f="x y" touch $f
      then your shell is not /bin/sh compatible. Whitespace parsing happens before variable parsing in every bourne-ish shell I've used since the late 70s.

      As for "my" syntax:

      < $file | wc -l
      you erroneously put an extra pipe in there. Remove it, try again, and give yourself minus 1 point for bad copying.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.


      update: Yeah, I apparently got the first part wrong. My memory of shell programming has really creeped away over the years. However, I just noticed that zsh does it differently than bash, which is probably why I now misremember. zsh does work the way I stated.

      However, the second part does work on real shells, just not on bash or csh.

        Whitespace parsing happens before variable parsing in every bourne-ish shell I've used since the late 70s.

        I guess the intersection of the sets of shells we have used is empty then. Anyway, here's the relevant portion of IEEE Std 1003.1. From section 2.6:

        The order of word expansion shall be as follows:
        1. Tilde expansion (see Tilde Expansion), parameter expansion (see Parameter Expansion), command substitution (see Command Substitution), and arithmetic expansion (see Arithmetic Expansion) shall be performed, beginning to end. See item 5 in Token Recognition.
        2. Field splitting (see Field Splitting) shall be performed on the portions of the fields generated by step 1, unless IFS is null.
        3. Pathname expansion (see Pathname Expansion) shall be performed, unless set -f is in effect.
        4. Quote removal (see Quote Removal) shall always be performed last.
        As you see, parameter expansion happens before word splitting.

        Here's the relevant section from the bash manual:

        The  order  of expansions is: brace expansion, tilde expansion, parame-
        ter, variable and arithmetic expansion and command  substitution  (done
        in a left-to-right fashion), word splitting, and pathname expansion.
        
        Of course you say "New fangled things! GNU, POSIX, who needs them! V7, that's what real men use." So be it. From the Unix V7 manual:
        Blank interpretation
        After parameter and command substitution, any result of substitution are scanned for internal field separator characters (those found in $IFS) and split into distinct arguments where such characters are found. Explicit null arguments ("" or '') are retained. Implicite null arguments (those resulting from parameters that have no values) are removed.
        Now, I don't want to claim you are wrong, but if you have never programmed in the Unix V7 shell, GNU bash, or a POSIX compliant shell, which shells have you used since the 70s?

        As for "my" syntax:

        < $file | wc -l
        you erroneously put an extra pipe in there. Remove it, try again, and give yourself minus 1 point for bad copying.

        You're right. Think it will help, removing that pipe? Let's find out!

        $ echo "hello" > data1 $ echo "world" > data2 $ file="data1 data2" $ < $file wc -l bash: $file: ambiguous redirect
        Nope. Guess my "useless cat" is still very very useful.

        Forgeting to write "$x" instead of $x is a classic shell programming mistake which results in things breaking for strings that contain whitespace. And it has been a classic mistake since the '70s.

        If I were hiring for a job that required shell programming, that'd be one of the questions I'd ask.

        - tye        

        However, the second part does work on real shells, just not on bash or csh.
        I presume you mean with "real shells", your current favourite shell, "zsh". You are only partially right. You are right that the syntax works, but not the semantics. In
        file="data1 data2" <$file wc -l
        zsh does not give you the number of lines in the files "data1" and "data2". Instead, it gives you the number of lines of the file (singular) "data1 data2". The use of cat isn't going to save the day though,
        file="data1 data2" cat $file | wc -l
        also gives a count of the number of lines in the file "data1 data2".

        No doubt zsh has a way of getting the count of lines from both files, after all, zsh is supposed to have every feature under the sun and then some, but it's not <$file.

Re^5: Sort Large Files
by jdporter (Paladin) on Mar 21, 2005 at 04:19 UTC
    Did you actually try those?
    I thought the point was that some calls to cat can't be eliminated:
    cat data1 data2 | wc -l # How would you do this with redirects?
    cat -s sq.my.bl | wc -l # cat itself can do some processing.