Better "uniq" idiom?

RMGir has asked for the wisdom of the Perl Monks concerning the following question:

I have an idiom I use a lot to do quick searches through log files or data streams, or to do a "uniq" on a file without having to sort it.

I'm wondering if someone can suggest a better way to do it, out of curiosity. If no one has a better way, then here's my way, you're welcome to use it :)

Simple "uniq" (print unique lines):

perl -ne'print unless $seen{$_}++' filesToUniq
[download]

(Note that I'm not using strict, not using -w. These are meant to be one-liners)

Print out all the seen values for a field (in this example, all the HTTP protocol versions used to access my web server)

perl -ne'm[HTTP/([^"]+)] or next; print unless $seen{$1}++' access_log
[download]

That version prints out the first line on which a given protocol is encountered. If you prefer just enumerating the protocols, use:

perl -ne'm[HTTP/([^"]+)] or next; print "$1\n" unless $seen{$1}++' acc
+ess_log
[download]

This is hardly "perl rocket science", obviously. But I figured it might be useful, since I wind up typing "$seen{$_}++" a lot every day...

Hmmm, maybe I should drop the 'een' and improve my golf score. Oops, wrong thread! :)
--
Mike

Comment on Better "uniq" idiom? Select or Download Code

Replies are listed 'Best First'.
Re: Better "uniq" idiom? by Juerd (Abbot) on Mar 20, 2002 at 17:16 UTC
I have an idiom I use a lot to do quick searches through log files or data streams, or to do a "uniq" on a file without having to sort it. Using a hash to keep track of what has already gone by is quite common. Even the name you used is seen everywhere: `%seen`. You can however make it a bit faster. You're now increasing `$seen{$1}` every time, while you don't really care if it's 1 or 2 or 3, as long as it's not 0 (undef). `perl -ne'$seen{$_}++, print unless $seen{$_}';` [download] It's not great for your golf score, but it can be a lot faster! Speaking of golf, you could try this one to improve your golf score: `# 123456789_12345 perl -pe'$_=""if$s{$_}++'` [download] Or, using evil symbolic references: `# 123456789_12 perl -pe'$_=""if$$_++'` [download] (Can break when the last line has no trailing linefeed, and equals the name of a special (scalar) variable ;)) U28geW91IGNhbiBhbGwgcm90MTMgY W5kIHBhY2soKS4gQnV0IGRvIHlvdS ByZWNvZ25pc2UgQmFzZTY0IHdoZW4 geW91IHNlZSBpdD8gIC0tIEp1ZXJk	[reply] [d/l] [select]
Re: Re: Better "uniq" idiom? by Sidhekin (Priest) on Mar 21, 2002 at 15:14 UTC
I have been aware that you can make it faster that way, but I usually don't. All that extra typing hurts my fingers :-) No not really, but I like to keep oneliners short -- they have a nasty habit of hitting the right margin once what I do gets complicated -- and if I ever care about the performance of a oneliner, chances are it is getting complicated. Having said that, it occurs to me that you might get the best of both worlds with a slightly different idiom. Compare: `perl -ne'print unless $seen{$_}++' perl -ne'$seen{$_}++, print unless $seen{$_}' perl -ne'$seen{$_}\|\|=(print,1)'` [download] Hey! It's even shorter than the original! We may be onto something here ... or maybe I have just been golfing too much lately ... :-) Update: Juerd writes for oneliners, I think it's safe to assume print will print succesfully. Well ... I guess it is a good thing you have not seen my one-liners then. But okay, safe or not, it certainly is reasonable. It's not your fault that I am neither :-) The Sidhekin `print "Just another Perl ${\(trickster and hacker)},"`	[reply] [d/l] [select]
Re: Re: Re: Better "uniq" idiom? by Juerd (Abbot) on Mar 21, 2002 at 22:09 UTC
perl -ne'$seen{$_}\|\|=(print,1)' Assuming the print will not fail (and for oneliners, I think it's safe to assume print will print succesfully): `perl -ne'$seen{$_}\|\|=print'` [download] See? print, like many other commands, returns true on success, which allows you to shorten the shortened shortening by another 4 characters! Implementing s/seen/s/: `perl -ne'$s{$_}\|\|=print'` [download] Using symbolic references, assuming no special variable names will be used: `perl -ne'$$_\|\|=print'` [download] I think Perl 6 should have an alias for print that is a single \W character ;) Would be fun for golfing, and could compensate for the needed whitespace with string concats :) U28geW91IGNhbiBhbGwgcm90MTMgY W5kIHBhY2soKS4gQnV0IGRvIHlvdS ByZWNvZ25pc2UgQmFzZTY0IHdoZW4 geW91IHNlZSBpdD8gIC0tIEp1ZXJk	[reply] [d/l] [select]
Re: Re: Better "uniq" idiom? by RMGir (Prior) on Mar 20, 2002 at 17:21 UTC
Ah, cool! Thanks! I was about to post "I don't get it" until I realized that the , expression is all subject to the unless clause. Nice! -- Mike	[reply]
Re: Better "uniq" idiom? by perrin (Chancellor) on Mar 20, 2002 at 16:59 UTC
If you just want a one-liner to unique a list, I'd use `sort -u filesToUniq` instead. It will use much less memory than your one-liner, and it's faster to type.	[reply] [d/l]
Re: Re: Better "uniq" idiom? by RMGir (Prior) on Mar 20, 2002 at 17:04 UTC
True, but it has the downside of doing a sort first, which could kill you on a long file which has many repeated lines... If the file is mostly unique, your way wins hands down, no question. But I usually wind up "uniq"ing on a field, so I usually see only about 10-100 lines per file. It would be interesting to benchmark. -- Mike	[reply]