RMGir has asked for the wisdom of the Perl Monks concerning the following question:

I have an idiom I use a lot to do quick searches through log files or data streams, or to do a "uniq" on a file without having to sort it.

I'm wondering if someone can suggest a better way to do it, out of curiosity. If no one has a better way, then here's my way, you're welcome to use it :)

Simple "uniq" (print unique lines):

perl -ne'print unless $seen{$_}++' filesToUniq
(Note that I'm not using strict, not using -w. These are meant to be one-liners)

Print out all the seen values for a field (in this example, all the HTTP protocol versions used to access my web server)

perl -ne'm[HTTP/([^"]+)] or next; print unless $seen{$1}++' access_log
That version prints out the first line on which a given protocol is encountered. If you prefer just enumerating the protocols, use:
perl -ne'm[HTTP/([^"]+)] or next; print "$1\n" unless $seen{$1}++' acc +ess_log
This is hardly "perl rocket science", obviously. But I figured it might be useful, since I wind up typing "$seen{$_}++" a lot every day...

Hmmm, maybe I should drop the 'een' and improve my golf score. Oops, wrong thread! :)
--
Mike

Replies are listed 'Best First'.
Re: Better "uniq" idiom?
by Juerd (Abbot) on Mar 20, 2002 at 17:16 UTC

    I have an idiom I use a lot to do quick searches through log files or data streams, or to do a "uniq" on a file without having to sort it.

    Using a hash to keep track of what has already gone by is quite common. Even the name you used is seen everywhere: %seen. You can however make it a bit faster. You're now increasing $seen{$1} every time, while you don't really care if it's 1 or 2 or 3, as long as it's not 0 (undef).

    perl -ne'$seen{$_}++, print unless $seen{$_}';
    It's not great for your golf score, but it can be a lot faster!

    Speaking of golf, you could try this one to improve your golf score:
    # 123456789_12345 perl -pe'$_=""if$s{$_}++'
    Or, using evil symbolic references:
    # 123456789_12 perl -pe'$_=""if$$_++'
    (Can break when the last line has no trailing linefeed, and equals the name of a special (scalar) variable ;))

    U28geW91IGNhbiBhbGwgcm90MTMgY
    W5kIHBhY2soKS4gQnV0IGRvIHlvdS
    ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
    geW91IHNlZSBpdD8gIC0tIEp1ZXJk
    

      I have been aware that you can make it faster that way, but I usually don't. All that extra typing hurts my fingers :-)

      No not really, but I like to keep oneliners short -- they have a nasty habit of hitting the right margin once what I do gets complicated -- and if I ever care about the performance of a oneliner, chances are it is getting complicated.

      Having said that, it occurs to me that you might get the best of both worlds with a slightly different idiom. Compare:

      perl -ne'print unless $seen{$_}++' perl -ne'$seen{$_}++, print unless $seen{$_}' perl -ne'$seen{$_}||=(print,1)'

      Hey! It's even shorter than the original! We may be onto something here ... or maybe I have just been golfing too much lately ... :-)

      Update: Juerd writes for oneliners, I think it's safe to assume print will print succesfully. Well ... I guess it is a good thing you have not seen my one-liners then. But okay, safe or not, it certainly is reasonable. It's not your fault that I am neither :-)

      The Sidhekin
      print "Just another Perl ${\(trickster and hacker)},"

        perl -ne'$seen{$_}||=(print,1)'

        Assuming the print will not fail (and for oneliners, I think it's safe to assume print will print succesfully):

        perl -ne'$seen{$_}||=print'
        See? print, like many other commands, returns true on success, which allows you to shorten the shortened shortening by another 4 characters!

        Implementing s/seen/s/:
        perl -ne'$s{$_}||=print'
        Using symbolic references, assuming no special variable names will be used:
        perl -ne'$$_||=print'
        I think Perl 6 should have an alias for print that is a single \W character ;) Would be fun for golfing, and could compensate for the needed whitespace with string concats :)

        U28geW91IGNhbiBhbGwgcm90MTMgY
        W5kIHBhY2soKS4gQnV0IGRvIHlvdS
        ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
        geW91IHNlZSBpdD8gIC0tIEp1ZXJk
        

      Ah, cool! Thanks!

      I was about to post "I don't get it" until I realized that the , expression is all subject to the unless clause. Nice!
      --
      Mike

Re: Better "uniq" idiom?
by perrin (Chancellor) on Mar 20, 2002 at 16:59 UTC
    If you just want a one-liner to unique a list, I'd use sort -u filesToUniq instead. It will use much less memory than your one-liner, and it's faster to type.
      True, but it has the downside of doing a sort first, which could kill you on a long file which has many repeated lines...

      If the file is mostly unique, your way wins hands down, no question.

      But I usually wind up "uniq"ing on a field, so I usually see only about 10-100 lines per file.

      It would be interesting to benchmark.
      --
      Mike