comment on

Hi. I'm trying to remove duplicate entries in a text file. It's a simple English-Polish dictionary. Basically, I have a file like this:

anatomy=anatomia
ancestor=poprzednik
ancestor=przodek
ancestral=dziedziczny
ancestral=rodowy
ancestry=pochodzenie
ancestry=przodkowie
anchor=kotwica
[download]

when what I want is this:

anatomy=anatomia
ancestor=poprzednik, przodek
ancestral=dziedziczny, rodowy
ancestry=pochodzenie, przodkowie
anchor=kotwica
[download]

My problem is - I don't know my way around Perl the way I'd like to, so I'm not sure how to approach this. Right now, I'm thinking about regular expressions and substitutions, something like this:

  #!/usr/bin/perl
  while (<>) {
      s{
        (^[^=]+)        #should match the duplicated word
        [=]
        (.+)            #should be the translation after the "="
        \n
        \1
        [=]
        (.+)
        \n
     }{$1=$2, $3}xig;
    print;
  }
[download]

But I think that (if this works), it's a solution for doubled entries, not tripled ones. Could anyone tell me, how to replace any number of repetitions in a file like mine? For example, my input file is:

ancient=starozytny
ancillary=pomocniczy
ancillary=sluzebny
ancillary=wspomagajacy
and=a, coraz, i
and=oraz
anecdote=anegdota
anemone=zawilec
[download]

and the output should be:

ancient=starozytny
ancillary=pomocniczy, sluzebny, wspomagajacy
and=a, coraz, i, oraz
anecdote=anegdota
anemone=zawilec
[download]

Thanks for any suggestions.

In reply to Remove duplicate words from a dictionary by 1Nf3

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.