http://qs1969.pair.com?node_id=457171


in reply to Help with a Regex

First, to report back on which Monk's proposed solutions worked best with my de-dyslexified output.

Animator's code worked flawlessly with my revised desired output. I used it as inspiration in my final solution, below.

tlm, demerphq, and insaniac posted solutions that came very close indeed to what I needed, but were not quite right for the problem. kwaping's was right on the money for its sole test case... Nevertheless, I have learned and will continue to learn from what each posted.

I thank all for their contributions, whether such contained code or not.

Now, to the problem and its final solution...

Please remember, the Roman Numerals were a "dumbed down" version of the real data and regex, to make for shorter code and test cases. As I mentioned, some characters that occur in an earlier part of my 'real' regex could also occur later. Note the comments in the code below to find out how I accomplished this.

I decided to split my problem into two steps, one step per script (and the posted regex is still a simplified version, though closer to actual). I wanted first to discard an earlier, one or two character match in favor of a longer, later match. I did this with LongestMatch.pl, below, which as noted in the script is pretty much a verbatim solution from Jeffrey Friedl's book, Mastering Regular Expressions .

The second script below, Align.pl, borrows heavily from Animator's offering. It takes the longest match and "pads" it for "missing" characters.

Also keep in mind, I'm still pretty new at this. My code may well still be ugly. Thanks to PM and the wisdom of its kind denizens, that's getting better. Thanks to the Monks listed above, the code now does what I what it to, too. That's more than I had when I started. Thanks again!

(BTW, for anyone wondering, it's a Linguistics thing. Morphology.)

#! /usr/local/bin/perl -w #LongestMatch.pl #from Friedl: Mastering Regular Expressions, pp. 334-335 use strict; my $longest_match = undef; # We'll keep track of the longest match her +e my $RecordPossibleMatch = qr{ (?{ # Check to see if the current match ($&) is the longest so far if (not defined($longest_match) or length($&) > length($longest_match)) { $longest_match = $&; } }) (?!) # Force failure so we'll backtrack to find further "matches" }x; while (<>) { chomp; $_ =~ m{ [abcdef]{0,1} [gh]{0,1} [ij]{0,1} [klmn]{0,1} [op]{0,1} [qr]{0,1} [astuv]{0,1} [gh]{0,1} [wxyz]{0,1} $RecordPossibleMatch }x; # Now report the accumulated result, if any if (defined($longest_match)) { print "$longest_match\n"; } else { print ".\n"; } $longest_match = undef; }


#! /usr/local/bin/perl -w #Align.pl use strict; my %Lookup = (a => 0, b => 0, c => 0, d => 0, e => 0, f => 0, g => 1, h => 1, i => 2, j => 2, k => 3, l => 3, m => 3, n => 3, o => 4, p => 4, q => 5, r => 5, s => 6, t => 6, u => 6, v => 6, w => 8, x => 8, y => 8, z => 8); while (<>) { chomp; my @letters = split(//); my $final_string = '.' x 10; my $count = 0; foreach my $letter (@letters) { if (exists $Lookup{$letter}) { substr ($final_string, $Lookup{$letter}, 1, $letter); $count++; if ($count = 1) { $Lookup{"a"} = 6; #Allow for a later occurrence } if ($count = 2) { $Lookup{"g"} = 7; $Lookup{"h"} = 7; } } } print "$final_string\n"; $Lookup{"a"} = 0; #Reset for next word $Lookup{"g"} = 1; $Lookup{"h"} = 1; }