Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I am trying to filter out archaic entries from a dictionary. They are marked with "(arch)". This is my best regexp so far (sorry, I never quite got used to using /x):

s#(?<=;)(?:\(\S+\) )*\(\d+\) (?:\(\S+\) )*\(arch\).*?;($| \(\d+\))#$1#g

And here is a selection of lines I am trying to filter:

(n,vs) (1) look; glimpse; glance; (vs) (2) to glance; to glimpse; (3) +(arch) first meeting; (adv) (4) apparently; seemingly; (n-t,n-adv) (1) moment; a (short) time; a while; (2) former times; (3) + (arch) two-hour period; (v5s,vt) (1) to pass (time); to spend; (2) to overdo (esp. of one's al +cohol consumption); to drink (alcohol); (3) (arch) to take care of; t +o support; (suf,v5s) (4) to overdo; to do too much; (5) to ... withou +t acting on it; (pn,adj-no) (1) we; us; (2) (arch) I; me; (3) (arch) you (referring to + a group of one's equals or inferiors); (n) (1) eye; eyeball; (2) (arch) pupil and (dark) iris of the eye; (3) + (arch) insight; perceptivity; power of observation; (4) (arch) look; + field of vision; (5) (arch) core; center; centre; essence; (v5m,vt) (1) to step on; to tread on; (2) to experience; to undergo; ( +3) to estimate; to value; to appraise; (4) to rhyme; (5) (arch) to in +herit (the throne, etc.); (6) to follow (rules, morals, principles, e +tc.); (v5s,vt) (1) to build up; to establish; (2) to form; to become (a stat +e); (3) to accomplish; to achieve; to succeed in; (4) to change into; + (5) to do; to perform; (aux-v) (6) (arch) to intend to; to attempt; +to try; (7) (arch) to have a child; (adv) (1) (uk) that is to say; that is; in other words; I mean; (2) (u +k) in short; in brief; to sum up; ultimately; in the end; in the long + run; when all is said and done; what it all comes down to; when you +get right down to it; (n) (3) (uk) clogging; obstruction; stuffing; ( +degree of) blockage; (4) (uk) shrinkage; (5) (uk) end; conclusion; (6 +) (uk) (arch) dead end; corner; (7) (uk) (arch) distress; being at th +e end of one's rope; (n,adj-no) (1) inside; within; (2) while; (3) among; amongst; between; + (pn,adj-no) (4) we (referring to one's in-group, i.e. company, etc.) +; our; (5) my spouse; (n) (6) (arch) imperial palace grounds; (v5r,vi) (1) to rot; to go bad; to decay; to spoil; to fester; to deco +mpose; to turn sour (e.g. milk); (2) to corrode; to weather; to crumb +le; (3) to become useless; to blunt; to weaken (from lack of practice +); (4) to become depraved; to be degenerate; to be morally bankrupt; +to be corrupt; (5) to be depressed; to be dispirited; to feel discour +aged; to feel down; (suf,v5r) (6) (uk) (ksb:) indicates scorn or disd +ain for another's action; (v5r,vi) (7) (arch) to lose a bet; (8) (arc +h) to be drenched; to become sopping wet; (v5s,vt) (1) to build up; to establish; (2) to form; to become (a stat +e); (3) to accomplish; to achieve; to succeed in; (4) to change into; + (5) to do; to perform; (aux-v) (6) (arch) to intend to; to attempt; +to try; (7) (arch) to have a child;

The format seems to be (part-of-speech) (number) (tags) colon-separated definitions, where tags contains the arch tag, and part-of-speech is repeated only once.

Replies are listed 'Best First'.
Re: Removing with regexps
by choroba (Cardinal) on Apr 21, 2012 at 16:03 UTC
    Only regex (or better, only substitution) is probably not the right tool here. After removing (5) (arch) to inherit (the throne, etc.);, your dictionary will contain (4) followed by (6).

    I would rather parse each line into an array (e.g. using a regex), then parse each entry and delete it from the array if archaic. Finally, I would serialize the array back to string.

Re: Removing with regexps
by roboticus (Chancellor) on Apr 21, 2012 at 17:58 UTC

    Looking at your data, it looks like you have a word "type" followed by a list of numbered definitions. I think I'd split each record apart at the numbers, discarding the numbers (since otherwise you'll get gaps in numbering anyway). Then I'd filter out the ones I don't want. It goes like this:

    use strict; use warnings; while (<DATA>) { s/\s+$//; my ($type, @defs) = split /\(\d+\)/,$_; print "\n------ Word type: $type\n"; @defs = grep { ! m{\(arch\)} } @defs; print "\t$_\n" for @defs; } __DATA__ (n,vs) (1) look; glimpse; glance; (vs) (2) to glance; to glimpse; (3) +(arch) first meeting; (adv) (4) apparently; seemingly; (n-t,n-adv) (1) moment; a (short) time; a while; (2) former times; (3) + (arch) two-hour period;

    When I run it, I get this:

    $ perl 966375.pl ------ Word type: (n,vs) look; glimpse; glance; (vs) to glance; to glimpse; apparently; seemingly; ------ Word type: (n-t,n-adv) moment; a (short) time; a while; former times;

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Removing with regexps
by JavaFan (Canon) on Apr 21, 2012 at 16:17 UTC
    my $lemma = qr { (?(DEFINE) (?<pretag> \( [a-z] [-0-9a-z,]* \) \s*) (?<tag> \( [a-z]+ \) \s*) (?<arch> \( arch \) \s*) (?<count> \( [0-9]+ \) \s*)) (?&pretag)* (?&count) (?&tag)* (?&arch) (?&tag)* [^;]* (?:; (?! \s* (?:$ | (?&pretag)* (?&count))) [^;]*)* ; \ * }x; s/$lemma//g;
Re: Removing with regexps
by Kenosis (Priest) on Apr 21, 2012 at 18:11 UTC

    Here's another option:

    for(split /\n/, $old_entires) { my($count, $num, $new_entry) = qw(0 0); map{$new_entry .= $_ . ($count++ < $num-1 ? '(' . $count . ')': "\ +n")} grep{!/\(arch\)/ && ++$num} split / ?\(\d\)/; print $new_entry; }

    Hope this helps!