Removing with regexps

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I am trying to filter out archaic entries from a dictionary. They are marked with "(arch)". This is my best regexp so far (sorry, I never quite got used to using /x):

s#(?<=;)(?:$\S+$ )*$\d+$ (?:$\S+$ )*$arch$.*?;($| $\d+$)#$1#g

And here is a selection of lines I am trying to filter:

(n,vs) (1) look; glimpse; glance; (vs) (2) to glance; to glimpse; (3) 
+(arch) first meeting; (adv) (4) apparently; seemingly;
(n-t,n-adv) (1) moment; a (short) time; a while; (2) former times; (3)
+ (arch) two-hour period;
(v5s,vt) (1) to pass (time); to spend; (2) to overdo (esp. of one's al
+cohol consumption); to drink (alcohol); (3) (arch) to take care of; t
+o support; (suf,v5s) (4) to overdo; to do too much; (5) to ... withou
+t acting on it;
(pn,adj-no) (1) we; us; (2) (arch) I; me; (3) (arch) you (referring to
+ a group of one's equals or inferiors);
(n) (1) eye; eyeball; (2) (arch) pupil and (dark) iris of the eye; (3)
+ (arch) insight; perceptivity; power of observation; (4) (arch) look;
+ field of vision; (5) (arch) core; center; centre; essence;
(v5m,vt) (1) to step on; to tread on; (2) to experience; to undergo; (
+3) to estimate; to value; to appraise; (4) to rhyme; (5) (arch) to in
+herit (the throne, etc.); (6) to follow (rules, morals, principles, e
+tc.);
(v5s,vt) (1) to build up; to establish; (2) to form; to become (a stat
+e); (3) to accomplish; to achieve; to succeed in; (4) to change into;
+ (5) to do; to perform; (aux-v) (6) (arch) to intend to; to attempt; 
+to try; (7) (arch) to have a child;
(adv) (1) (uk) that is to say; that is; in other words; I mean; (2) (u
+k) in short; in brief; to sum up; ultimately; in the end; in the long
+ run; when all is said and done; what it all comes down to; when you 
+get right down to it; (n) (3) (uk) clogging; obstruction; stuffing; (
+degree of) blockage; (4) (uk) shrinkage; (5) (uk) end; conclusion; (6
+) (uk) (arch) dead end; corner; (7) (uk) (arch) distress; being at th
+e end of one's rope;
(n,adj-no) (1) inside; within; (2) while; (3) among; amongst; between;
+ (pn,adj-no) (4) we (referring to one's in-group, i.e. company, etc.)
+; our; (5) my spouse; (n) (6) (arch) imperial palace grounds;
(v5r,vi) (1) to rot; to go bad; to decay; to spoil; to fester; to deco
+mpose; to turn sour (e.g. milk); (2) to corrode; to weather; to crumb
+le; (3) to become useless; to blunt; to weaken (from lack of practice
+); (4) to become depraved; to be degenerate; to be morally bankrupt; 
+to be corrupt; (5) to be depressed; to be dispirited; to feel discour
+aged; to feel down; (suf,v5r) (6) (uk) (ksb:) indicates scorn or disd
+ain for another's action; (v5r,vi) (7) (arch) to lose a bet; (8) (arc
+h) to be drenched; to become sopping wet;
(v5s,vt) (1) to build up; to establish; (2) to form; to become (a stat
+e); (3) to accomplish; to achieve; to succeed in; (4) to change into;
+ (5) to do; to perform; (aux-v) (6) (arch) to intend to; to attempt; 
+to try; (7) (arch) to have a child;
[download]

The format seems to be (part-of-speech) (number) (tags) colon-separated definitions, where tags contains the arch tag, and part-of-speech is repeated only once.

Comment on Removing with regexps Select or Download Code

Replies are listed 'Best First'.
Re: Removing with regexps by choroba (Cardinal) on Apr 21, 2012 at 16:03 UTC
Only regex (or better, only substitution) is probably not the right tool here. After removing (5) (arch) to inherit (the throne, etc.);, your dictionary will contain (4) followed by (6). I would rather parse each line into an array (e.g. using a regex), then parse each entry and delete it from the array if archaic. Finally, I would serialize the array back to string.	[reply]
Re: Removing with regexps by roboticus (Chancellor) on Apr 21, 2012 at 17:58 UTC
Looking at your data, it looks like you have a word "type" followed by a list of numbered definitions. I think I'd split each record apart at the numbers, discarding the numbers (since otherwise you'll get gaps in numbering anyway). Then I'd filter out the ones I don't want. It goes like this: `use strict; use warnings; while (<DATA>) { s/\s+$//; my ($type, @defs) = split /$\d+$/,$_; print "\n------ Word type: $type\n"; @defs = grep { ! m{$arch$} } @defs; print "\t$_\n" for @defs; } __DATA__ (n,vs) (1) look; glimpse; glance; (vs) (2) to glance; to glimpse; (3) +(arch) first meeting; (adv) (4) apparently; seemingly; (n-t,n-adv) (1) moment; a (short) time; a while; (2) former times; (3) + (arch) two-hour period;` [download] When I run it, I get this: `$ perl 966375.pl ------ Word type: (n,vs) look; glimpse; glance; (vs) to glance; to glimpse; apparently; seemingly; ------ Word type: (n-t,n-adv) moment; a (short) time; a while; former times;` [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re: Removing with regexps by JavaFan (Canon) on Apr 21, 2012 at 16:17 UTC
`my $lemma = qr { (?(DEFINE) (?<pretag> $ [a-z] [-0-9a-z,]* $ \s) (?<tag> $ [a-z]+ $ \s) (?<arch> $ arch $ \s) (?<count> $ [0-9]+ $ \s)) (?&pretag)* (?&count) (?&tag)* (?&arch) (?&tag)* [^;]* (?:; (?! \s* (?:$ \| (?&pretag)* (?&count))) [^;]) ; \ * }x; s/$lemma//g;` [download]	[reply] [d/l]
Re: Removing with regexps by Kenosis (Priest) on Apr 21, 2012 at 18:11 UTC
Here's another option: `for(split /\n/, $old_entires) { my($count, $num, $new_entry) = qw(0 0); map{$new_entry .= $_ . ($count++ < $num-1 ? '(' . $count . ')': "\ +n")} grep{!/$arch$/ && ++$num} split / ?$\d$/; print $new_entry; }` [download] Hope this helps!	[reply] [d/l]