in reply to Dictionary filter regex

The easy way is to split this up into three checks:

  1. keep anything that matches /s.*h/i
  2. Reject anything that matches /s.*s/i
  3. Reject anything that matches /h.*h/i

Mushing this into a single regular expression is possible, by using [^sh] instead of dot, but I would stay with the three checks.

Replies are listed 'Best First'.
Re^2: Dictionary filter regex
by LanX (Saint) on Nov 26, 2016 at 18:39 UTC
    I second your approach to break up the logic into 3 regexes, but

    > Mushing this into a single regular expression is possible, by using [^sh] instead of dot

    Do you mean   /s[^sh]*h/i ?

    I doubt this, you would also need to check all characters before and after the match

      /^ [^sh]* s [^sh]* h [^sh]* $/xi *

    Otherwise something like "h--<s--h>--s" should match in the middle. (Untested)

    I think this demonstrates well why stuffing all logic into one regex is not always a good idea, particularly inversion isn't trivial.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    footnotes

    *) I just saw that tybalt89++ already posted this regex in this thread.

      Yes - my post would have been much clearer had I also included the anchors and the (then explicit) .* before and after the matches.

      Thank you for pointing this out and making it explicit!

Re^2: Dictionary filter regex
by Linicks (Scribe) on Nov 26, 2016 at 17:30 UTC

    Thanks to all that replied - Corion, great answer ~ I guess I am tired after a long week at work

    #!/usr/bin/perl -w my @words; my $line; open (DICT, "< final.txt"); @words= <DICT>; close (DICT); foreach $line(@words) { if ($line =~ /s.*h/i) { if ( ($line =~ /s.*s/i) || ($line =~ /h.*h/i) ) { next; } print $line; } } exit;

    Produces some great words ha ha!

    asthmatic asthmatical asthmatically asthmatoid asthmogenic asthore asthorin astrachan astrakhan astraphobia astraphobic astrapophobia astrochronological ... crystallographically crystallography crystallophyllian crystograph ctesiphon cubbish cubbishly cuemanship cuish cultish cultishly culture shock cumshaw cunctatorship cuneoscaphoid cuproscheelite curateship curatorship curiosity killed the cat

    Thanks! Nick

      Hi Linicks,

      a couple of a comments to improve your code.

      open (DICT, "< final.txt");
      Good practices nowadays recommend to use lexical file handles and the three-argument syntax for the open built-in function (and also to check that open succeeded):
      open my $DICT, "<", "final.txt" or die "cannot open final.txt$!";
      Second, if your file is large, it is a waste of resources (memory, CPU cycles and time) to store its contents into an array and then process the array, whereas you could just process directly the lines obtained from the file (unless you want to make several other searches on the same data):
      open my $DICT, "<", "final.txt" or die "cannot open final.txt$!"; while (my $word = <$DICT>) { next unless $word =~ /s.*h/i; next if $word =~ /s.*s/i or $word =~ /h.*h/i; print $word; }
      You could also use a series of greps to filter your data:
      open my $DICT, "<", "final.txt" or die "cannot open final.txt$!"; print for grep { not /h.*h/i } grep { not /s.*s/i } grep /s.*h/i, <$D +ICT>;
      or possibly only one grep with a composite condition.

      Update: fixed the typo mentioned by Linicks: s/~=/=~/;.

        Thanks, interesting. I see you also have the issue I have - _differnet_ typos ~

        next if $word ~=

        Heh.

        My original code produces:

        time perl sh.pl > sh.txt real 0m3.770s user 0m3.692s sys 0m0.074s

        ...and using the great sort of 4 liner while loop:

        time perl sh.pl > sh.txt real 0m4.192s user 0m4.170s sys 0m0.015s

        seems slower.

        Also, as to the error on open a file, I never bother when doing it in a terminal on a local machine as I know the file exists - in other circumstances I would, of course.

        Thanks for your input!

        Nick

        P.S. The first word my dictionary file pulls up is abandon ship