Regex help/condensation

smgfc has asked for the wisdom of the Perl Monks concerning the following question:

So, while finishing my parser for geometeric proofs, I decided to make the word "triangle" become "?" etc.etc. so I have a hash, and when a line comes up, it is split into words and then checked against the hash. This works great except, sometimes I have "triangles" instead of triangle, so i made my regex compensate, but it doesnt seem to be working

 s/(^| )$i( |s|$)/"$1$def{$i}$2"/gie;
[download]

(the full code is below, and $i is the word ie "triangle"). Also I have a number of regex that I think could be combined, all of which deal wit h white space:

s/^\s+//; #trim leading whitespace

s/\s+$//; #trim trailing whitespace

s/(\S+)\s{2,}/"$1 "/gie; #replace more then one space with one space

s/\s{2,}(\S+)/" $1"/gie; #replace more then one space with one space
[download]

I would also like to make a regex that finds the names of symbols in the proof and then make sure they are capitalized, but can't seen any distinct way to identify them, i usually end up capitalizing some thing like "angle BISECTOR" instead of "angle ABC". The code and some sample data are below, any other tips are more then welcome as well (since the code and data have very long lines it will be easier to understand if you turn off word wrap :) )
sample data:

T:
Problem #3
A:
WIlliam Meyer
G:
triangle ABC with angle bisectors segment AZ, segment BY, segment CX
triangle DEF with angle bisectors segment DJ, segment EI, segment FH
P:
angle bisectors in similar triangles same ratio as corresponding sides
Pr:
triangle ABC with altitudes segment AZ, segment BY, segment CX => G
triangle DEF with altitudes segment DJ, segment EI, segment FH => G
( m segment AB / m segment DE ) = ( m segment BC / m segment EF ) = ( 
+m segment AC / m segment DF )
angle BAC congruent angle EDF => def similar triangles
angle ABC congruent angle DEF => def similar triangles
m angle ABY * 2 = m angle ABC => def angle bisector
m angle DEI * 2 = m angle DEF => def angle bisector
m angle DEI * 2 = m angle ABC => sub #-1, #-3
m angle DEI * 2 = m angle ABY * 2 => sub #-1, #-3
angle DEI congruent angle ABY => division &amp;amp;amp;amp;amp; doca
triangle AYB similar triangle DIE => AA #-1, #-7
( m segment AB / m segment DE ) = ( m segment BY / m segment EI ) => s
+imilar triangles
( m segment BY / m segment EF ) = ( m segment BC / m segment EF ) = ( 
+m segment AC / m segment DF ) => sub #-1, #-5
( m segment BC / m segment EF ) = (    m segment CX / m segment FH ) =
+> similar reasoning #-2
( m segment AB / m segment DE ) = ( m segment CX / m segment FH ) = ( 
+m segment AC / m segment DF ) => similar reasoning #-2
( m segment AC / m segment DF ) = ( m segment AZ / m segment DJ ) => s
+imilar reasoning #-4
( m segment AB / m segment DE ) = ( m segment BC / m segment EF ) = ( 
+m segment AZ / m segment DJ ) => similar reasoning #-4
[download]

actual parser:

#!/usr/bin/perl -w
use strict;

my ($file_parse, $file_output, %def, $seen_pr, $tab, @stmts, @reasons,
+ $len_stmts, $len_reasons, @pad, $i);

$file_parse="untitled:Desktop Folder:test2";         # default to be r
+ead
$file_output="untitled:Desktop Folder:test2output";  # default to be w
+ritten
$tab=4;                                              # default tab len
+gth
$seen_pr=0;                                          # wether "Pr:" ha
+s been seen in while loop W
%def = (                                             # used for parsin
+g the symbols in $file_parse
    "T:" => "Title",
    "A:" => "Author(s)",
    "G:" => "Given(s)",
    "P:" => "Proof Statement(s)",
    "T1" => "Statements",
    "T2" => "Reasons",
    "point" => "·",
    "triangle" => "?",
    "angle" => "‹",
    "w" => "with",
    "m" => "measure",
    "s" => "segment"
);


#####
# Some code establishing what file you are working on and what to outp
+ut to
#####

print "Will parse $file_parse (y/n):";
$_=<>;
if (!/y/i) {
    print "Which file: ";
    $file_parse=<>;
    chomp $file_parse;
}

print "Will write $file_output (y/n):";
$_=<>;
if (!/y/i) {
    print "Which file: ";
    $file_output=<>;
    chomp $file_output;
}


#####
# Start a loop to go through the file to be parsed
#####

open (PARSE, "<$file_parse") or die ("Can't open $file_parse: $!\n");

open (OUTPUT, ">$file_output") or die ("Can't open $file_output: $!\n"
+);

W: while (<PARSE>) {
    chomp;

    s/^\s+//; #trim leading whitespace
    s/\s+$//; #trim trailing whitespace
    s/(\S+)\s{2,}/"$1 "/gie; #replace more then one space with one spa
+ce
    s/\s{2,}(\S+)/" $1"/gie; #replace more then one space with one spa
+ce

    foreach $i (split " ", $_) {
        if (exists $def{$i} and !/:/) {
            s/(^| )$i( |s|$)/"$1$def{$i}$2"/gie; #replace symbols defi
+ned in %def with there meaning
        }
    }

    if (!/pr:/i and $seen_pr==0) {                      
        if (/:/) {                                 # if $_ contains a 
+":" then regard as token and print out
            $_ = uc();
            print OUTPUT "\n" if !/t:/i;                   
            print OUTPUT $def{$_} . ":\n";
            next W;
        } else {                                   # Print the items u
+nder the token
            print OUTPUT " " x $tab . $_ . "\n"; 
            next W;
        }
    } elsif ($seen_pr==0) {
        $seen_pr=1;
        print OUTPUT "\n";
        next W;
    }


    m/(.+) => (.+)/; #format is statement => reason
    push @stmts, $1; #constuct an array of statements
    push @reasons, $2; #constuct an array of reasons
}
close (PARSE);


#####
# Find the longest statement and reason for formatting
#####

$len_stmts   = (sort {$b <=> $a} map {length} @stmts  )[0];
$len_reasons = (sort {$b <=> $a} map {length} @reasons)[0];


#####
# Really difficult to understand way of formatting output
#####

push @pad, ( ( length($#stmts+1) + $len_stmts + 4) / 2) - int( length(
+ $def{"T1"} ) / 2 )  - length( $def{"T1"} ) % 2;
push @pad, ( ( length($#stmts+1) + $len_stmts + 4) / 2) - int( length(
+ $def{"T1"} ) / 2 ) +1;
push @pad, ( ( length($#reasons+1) + $len_reasons + 2) / 2) - int( len
+gth( $def{"T2"} ) ) / 2 - length( $def{"T2"} ) % 2;
push @pad, ( ( length($#reasons+1) + $len_reasons + 2) / 2) - int( len
+gth( $def{"T2"} ) ) / 2;

print OUTPUT "_" x $pad[0] . $def{"T1"} . "_" x $pad[1] . "|" . "_" x 
+$pad[2] . $def{"T2"} . "_" x $pad[3] . "\n";

for ($i=0; $i<$#stmts+1; $i++) {
    $reasons[$i] =~ s/#-(\d+?)/($i+1)-$1/ge; #replace #-? with the lin
+e # you are on minus ?
    print OUTPUT $i+1 . "." . " " x (length($#stmts+1) - length($i+1)+
+1) . $stmts[$i] . " " x ($len_stmts - length($stmts[$i])) . "  |  " .
+ $reasons[$i] . "\n"; #print the statements and resons
}
close (OUTPUT);
[download]

Comment on Regex help/condensation Select or Download Code

Replies are listed 'Best First'.
Re: Regex help/condensation by gav^ (Curate) on Feb 15, 2002 at 00:02 UTC
If you store regexps in your hash rather than just the string you might have an easier time: `my %def = ('triangles?' => '?'); s/\s+$//; # trailing s/^\s+//; # leading s/\s+/ /g; # multiple spaces to a single space s/\b$i\b/$def{$i}/gi;` [download] The \b will make it match on word boundaries. Updated: fixed stupid mistake... gav^	[reply] [d/l]
Re: Re: Regex help/condensation by Kanji (Parson) on Feb 15, 2002 at 04:43 UTC
I have to admit I'd prolly have used `s/\s+/ /g;`, too, but it's something tr is more suited to and ~~reputedly~~ faster for. `tr/ / /s;` Update: I had the time, so figured 'what the heck' ... `Benchmark: running s, tr, each for at least 5 CPU seconds ... s: 5 wallclock secs ( 5.15 usr + 0.01 sys = 5.16 CPU) @ 54047.31/s (n=278722) tr: 5 wallclock secs ( 5.05 usr + 0.00 sys = 5.05 CPU) @ 130668.19/s (n=659613) Rate s tr s 54047/s -- -59% tr 130668/s 142% --` [download] --k.	[reply] [d/l] [select]