abitkin has asked for the wisdom of the Perl Monks concerning the following question:
Okay, I have a regular expression for removing duplicate words. The requirements are such that, words can be split up by one or more literal space " " or newline "\n". Now any word that has been seen before, regardless of which line it was on, should not be printed. Here's what I have: while (<stdin>){
print map((s;$; ; && $_), grep(!$a{$_}++, split m;\s;,$_));
print "\n";
}
Now I know that there may be better ways, so I'd love to see how people can improve on this (mainly so that I learn the secrets of perl, as I always do when I ask a q.) Also note, it would be nice if there's a way to be just as fast, but less memory intensive, as the hash may end up holding as many as 4 million items, with keys of length 1 to 32.
Re: Removing repeated words
by Aristotle (Chancellor) on Sep 20, 2002 at 14:27 UTC
|
I have to comment that your use of semicolons as pattern delimiters rather obfuscates this - I had to look twice before I saw exactly what was happening. Rewritten in a more traditional fashion, and using chomp rather than a mapped substitution to get rid of newlines, it immediately becomes a lot more legible:
while (<STDIN>){
chomp;
print grep !$a{$_}++, split m/\s/;
print "\n";
}
I don't see a whole lot of room for improvement there. There is not really any way to avoid the hash growing that large, as far as I can see, as you can't simply expire any of its items.
Makeshifts last the longest. | [reply] [d/l] |
Re: Removing repeated words
by blakem (Monsignor) on Sep 20, 2002 at 14:32 UTC
|
Update: How 'bout a 1 liner?
perl -pale '$_="@{[grep !$s{$_}++, @F]}"'
original attempt below...
my %seen;
print join(' ', grep !$seen{$_}++, split), "\n" while <STDIN>;
-Blake
| [reply] [d/l] [select] |
|
++, this is more a -n than -p task though.
perl -lane 'print grep !$s{$_}++, @F'
Looks quite a bit less daunting, not don't it. :-)
Update: good catch, blakem++. The devil is in the details, they say in German..
Makeshifts last the longest.
| [reply] [d/l] |
|
The words run together if you just print the list, though $, can take care of that....
% perl -lane 'print grep !$s{$_}++, @F'
one two two three
onetwothree
% perl -lane '$,=$"; print grep !$s{$_}++, @F'
one two two three
one two theee
-Blake
| [reply] [d/l] [select] |
Re: Removing repeated words
by abitkin (Monk) on Sep 20, 2002 at 14:28 UTC
|
while (<stdin>){
print map((s/$/ / && $_), grep(!$a{$_}++, split m/\s+/,$_));
print "\n";
}
Update: Fri Sep 20 10:25:01 2002
Whoops, forgot the \s+.. And switch the regex delims by request. | [reply] [d/l] |
|
| [reply] [d/l] |
Re: Removing repeated words
by sauoq (Abbot) on Sep 20, 2002 at 16:33 UTC
|
The way you are doing it is pretty clean. At least, it would be if you cleaned it up like Aristotle suggested. You missed one detail though.
Your requirements stated that the words can be split up by one or more literal space " " or newline "\n". That's not what you are doing. You are splitting on a single \s. Nevermind that \s is a character class that contains more than literal spaces and newlines. It's probably what you want. (If not, use a custom character class.) You do, however, need to add a + and make that /\s+/ to split on "one or more" of them.
Update: As you are splitting, it is probably better to split ' ', ... than to split on /\s+/ as the latter will result in an initial null field in the case that there are leading spaces. Splitting on a literal space is a special case. See perldoc -f split for details.
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] [select] |
Re: Removing repeated words
by Thelonius (Priest) on Sep 20, 2002 at 18:40 UTC
|
Also note, it would be nice if there's a way to be just as fast, but less memory intensive, as the hash may end up holding as many as 4 million items, with keys of length 1 to 32.
There is a two-pass solution to this that uses very little memory. Parse the words from each line and output them to a pipe "|sort|uniq -d". Input the results of that pipe and you'll have a list of duplicate words to save in a hash.
The second time through your file you compare the words to that hash, something like:
if (!exists $dup{$_} || $dup{$_}++ == 0) {
print it
}
If you know that STDIN is seekable (i.e. a disk file, not pipe or socket or terminal), you can seek STDIN, 0, 0 to rewind. Otherwise you'll have to write a copy of the data somewhere for your second pass.
If what you are really after is a list of the unique words in a file and you don't care about the order or line breaks, you can just parse the words out to "|sort -u".
| [reply] [d/l] |
|
That won't fly.
For one, sort reads its entire input before outputting a single character, so the sort process will grow not only comparably to the hash inside the perl process, it will infact grow larger than the entire input file. It does considerably more work too - the single-pass approach doesn't need to sort the data since it uses a hash to keep words unique.
Secondly, the hash you're creating in the second pass is exactly as large as the hash would be at the end of the single-pass script - they both contain all unique words in the file. But you create that hash before you start processing, so your second pass will start out with as much memory consumed as the single-pass scripts reach only by the end of their processing.
Makeshifts last the longest.
| [reply] |
|
| [reply] |
|
|