Remove duplicate words from a dictionary

1Nf3 has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I'm trying to remove duplicate entries in a text file. It's a simple English-Polish dictionary. Basically, I have a file like this:

anatomy=anatomia
ancestor=poprzednik
ancestor=przodek
ancestral=dziedziczny
ancestral=rodowy
ancestry=pochodzenie
ancestry=przodkowie
anchor=kotwica
[download]

when what I want is this:

anatomy=anatomia
ancestor=poprzednik, przodek
ancestral=dziedziczny, rodowy
ancestry=pochodzenie, przodkowie
anchor=kotwica
[download]

My problem is - I don't know my way around Perl the way I'd like to, so I'm not sure how to approach this. Right now, I'm thinking about regular expressions and substitutions, something like this:

  #!/usr/bin/perl
  while (<>) {
      s{
        (^[^=]+)        #should match the duplicated word
        [=]
        (.+)            #should be the translation after the "="
        \n
        \1
        [=]
        (.+)
        \n
     }{$1=$2, $3}xig;
    print;
  }
[download]

But I think that (if this works), it's a solution for doubled entries, not tripled ones. Could anyone tell me, how to replace any number of repetitions in a file like mine? For example, my input file is:

ancient=starozytny
ancillary=pomocniczy
ancillary=sluzebny
ancillary=wspomagajacy
and=a, coraz, i
and=oraz
anecdote=anegdota
anemone=zawilec
[download]

and the output should be:

ancient=starozytny
ancillary=pomocniczy, sluzebny, wspomagajacy
and=a, coraz, i, oraz
anecdote=anegdota
anemone=zawilec
[download]

Thanks for any suggestions.

Comment on Remove duplicate words from a dictionary Select or Download Code

Replies are listed 'Best First'.
Re: Remove duplicate words from a dictionary by merlyn (Sage) on Dec 27, 2006 at 22:56 UTC
`my %mapping; while (<>) { my ($key, $value) = /^(.?)=(.)$/ or die "Cannot parse $_"; $mapping{$key}{$_} = 1 for split /, /, $value; } for my $word (sort keys %mapping) { my @aliases = sort keys %{$mapping{$word}}; print "$word=", join(", ", @aliases), "\n"; }` [download] -- Randal L. Schwartz, Perl hacker update: Yeah, I wanted it to be a HoH, fixed the code, sorry.	[reply] [d/l]
Re^2: Remove duplicate words from a dictionary by 1Nf3 (Pilgrim) on Dec 27, 2006 at 23:14 UTC
Thank you. Tomorrow I will test it against my 500k data file, but I'm sure it's just the thing I needed. Thank you for the solution, and thank you for the books. I would say "the books that had taught me everything I know about Perl", but as you see, I don't know much, so I'll write this: Thanks for the great books you wrote, for if I don't know much about Perl, it's not their fault.	[reply]
Re^2: Remove duplicate words from a dictionary by polettix (Vicar) on Dec 28, 2006 at 10:16 UTC
`%mapping` seems to be a HoH in the first cycle, and a HoA in the second. I'd probably modify to (untested): `my %mapping; while (<>) { my ($key, $value) = /^(.?)=(.)$/ or die "Cannot parse $_"; push @{$mapping{$key}}, split /, /, $value; } for my $word (sort keys %mapping) { my @aliases = sort @{$mapping{$word}}; print "$word=", join(", ", @aliases), "\n"; }` [download] or, if scared by possible repetitions (untested, again): `my %mapping; while (<>) { my ($key, $value) = /^(.?)=(.)$/ or die "Cannot parse $_"; $mapping{$key}{$_} = 1 for split /, /, $value; } for my $word (sort keys %mapping) { my @aliases = sort keys %{$mapping{$word}}; print "$word=", join(", ", @aliases), "\n"; }` [download] Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply] [d/l] [select]
Re: Remove duplicate words from a dictionary by ambrus (Abbot) on Dec 27, 2006 at 23:09 UTC
As a quick fix, `#!/usr/bin/perl local $/ = undef; $_ = <>; 1 while s{ (^[^=]+) #should match the duplicated word [=] (.+) #should be the translation after the "=" \n \1 [=] (.+ \n) }{$1=$2, $3}xmig; print;` [download]	[reply] [d/l]
Re: Remove duplicate words from a dictionary by siva kumar (Pilgrim) on Dec 28, 2006 at 07:05 UTC
Simply you can try this out `my %hashMap; open(FH,"test.txt") or die("Can't open file $! "); while (<FH>) { next if ($_ =~ /^\s$/); my ($key, $value) = /^(.?)=(.*)$/ or die "Cannot parse $_"; if($hashMap{$key}) { $hashMap{$key}.= ",". $value; }else{ $hashMap{$key} = $value; } } for my $word (sort keys %hashMap) { print "$word=".$hashMap{$word}. "\n"; }` [download]	[reply] [d/l]