Linicks has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have been thinking about this for the last few hours, running logic in my head, but can't even seem to get it... (so no code yet!).

I have a dictionary text file, one word on each line - what I require to do is only extract any word that has only one 'S' and only one 'H' in it in that order; so 'school' would match but 'schools' or 'hosepipes' wouldn't - but 'clash' would etc.

Any help appreciated, thanks, NIck

Replies are listed 'Best First'.
Re: Dictionary filter regex
by Corion (Patriarch) on Nov 26, 2016 at 17:04 UTC

    The easy way is to split this up into three checks:

    1. keep anything that matches /s.*h/i
    2. Reject anything that matches /s.*s/i
    3. Reject anything that matches /h.*h/i

    Mushing this into a single regular expression is possible, by using [^sh] instead of dot, but I would stay with the three checks.

      I second your approach to break up the logic into 3 regexes, but

      > Mushing this into a single regular expression is possible, by using [^sh] instead of dot

      Do you mean   /s[^sh]*h/i ?

      I doubt this, you would also need to check all characters before and after the match

        /^ [^sh]* s [^sh]* h [^sh]* $/xi *

      Otherwise something like "h--<s--h>--s" should match in the middle. (Untested)

      I think this demonstrates well why stuffing all logic into one regex is not always a good idea, particularly inversion isn't trivial.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      footnotes

      *) I just saw that tybalt89++ already posted this regex in this thread.

        Yes - my post would have been much clearer had I also included the anchors and the (then explicit) .* before and after the matches.

        Thank you for pointing this out and making it explicit!

      Thanks to all that replied - Corion, great answer ~ I guess I am tired after a long week at work

      #!/usr/bin/perl -w my @words; my $line; open (DICT, "< final.txt"); @words= <DICT>; close (DICT); foreach $line(@words) { if ($line =~ /s.*h/i) { if ( ($line =~ /s.*s/i) || ($line =~ /h.*h/i) ) { next; } print $line; } } exit;

      Produces some great words ha ha!

      asthmatic asthmatical asthmatically asthmatoid asthmogenic asthore asthorin astrachan astrakhan astraphobia astraphobic astrapophobia astrochronological ... crystallographically crystallography crystallophyllian crystograph ctesiphon cubbish cubbishly cuemanship cuish cultish cultishly culture shock cumshaw cunctatorship cuneoscaphoid cuproscheelite curateship curatorship curiosity killed the cat

      Thanks! Nick

        Hi Linicks,

        a couple of a comments to improve your code.

        open (DICT, "< final.txt");
        Good practices nowadays recommend to use lexical file handles and the three-argument syntax for the open built-in function (and also to check that open succeeded):
        open my $DICT, "<", "final.txt" or die "cannot open final.txt$!";
        Second, if your file is large, it is a waste of resources (memory, CPU cycles and time) to store its contents into an array and then process the array, whereas you could just process directly the lines obtained from the file (unless you want to make several other searches on the same data):
        open my $DICT, "<", "final.txt" or die "cannot open final.txt$!"; while (my $word = <$DICT>) { next unless $word =~ /s.*h/i; next if $word =~ /s.*s/i or $word =~ /h.*h/i; print $word; }
        You could also use a series of greps to filter your data:
        open my $DICT, "<", "final.txt" or die "cannot open final.txt$!"; print for grep { not /h.*h/i } grep { not /s.*s/i } grep /s.*h/i, <$D +ICT>;
        or possibly only one grep with a composite condition.

        Update: fixed the typo mentioned by Linicks: s/~=/=~/;.

Re: Dictionary filter regex
by tybalt89 (Monsignor) on Nov 26, 2016 at 17:07 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1176603 use strict; use warnings; my @extract = grep /^[^sh]*s[^sh]*h[^sh]*$/i, map tr/\n//dr, <DATA>; print "@extract\n"; __DATA__ school schools hosepipes
Re: Dictionary filter regex
by Anonymous Monk on Nov 26, 2016 at 21:21 UTC
    When all you have is a hammer...
    my @words = qw( clash school schools hosepipes crystallography ); print $_, "\n" for grep { tr/sh//cdr eq "sh" } @words;
      p.s. looks like others decided that it should be case insensitive, you can use lc or fc for that
      { fc =~ tr/sh//cdr eq "sh" }
Re: Dictionary filter regex
by pryrt (Abbot) on Nov 26, 2016 at 17:07 UTC

    so you want zero or more non-sh at the start of the string, then s, followed by zero or more non-sh, followed by h, followed by zero or more non-sh to the end of the string. Phrased that way, is there a solution that comes to mind?

    update: remove brackets