comment on

So, while finishing my parser for geometeric proofs, I decided to make the word "triangle" become "?" etc.etc. so I have a hash, and when a line comes up, it is split into words and then checked against the hash. This works great except, sometimes I have "triangles" instead of triangle, so i made my regex compensate, but it doesnt seem to be working

 s/(^| )$i( |s|$)/"$1$def{$i}$2"/gie;
[download]

(the full code is below, and $i is the word ie "triangle"). Also I have a number of regex that I think could be combined, all of which deal wit h white space:

s/^\s+//; #trim leading whitespace

s/\s+$//; #trim trailing whitespace

s/(\S+)\s{2,}/"$1 "/gie; #replace more then one space with one space

s/\s{2,}(\S+)/" $1"/gie; #replace more then one space with one space
[download]

I would also like to make a regex that finds the names of symbols in the proof and then make sure they are capitalized, but can't seen any distinct way to identify them, i usually end up capitalizing some thing like "angle BISECTOR" instead of "angle ABC". The code and some sample data are below, any other tips are more then welcome as well (since the code and data have very long lines it will be easier to understand if you turn off word wrap :) )
sample data:

T:
Problem #3
A:
WIlliam Meyer
G:
triangle ABC with angle bisectors segment AZ, segment BY, segment CX
triangle DEF with angle bisectors segment DJ, segment EI, segment FH
P:
angle bisectors in similar triangles same ratio as corresponding sides
Pr:
triangle ABC with altitudes segment AZ, segment BY, segment CX => G
triangle DEF with altitudes segment DJ, segment EI, segment FH => G
( m segment AB / m segment DE ) = ( m segment BC / m segment EF ) = ( 
+m segment AC / m segment DF )
angle BAC congruent angle EDF => def similar triangles
angle ABC congruent angle DEF => def similar triangles
m angle ABY * 2 = m angle ABC => def angle bisector
m angle DEI * 2 = m angle DEF => def angle bisector
m angle DEI * 2 = m angle ABC => sub #-1, #-3
m angle DEI * 2 = m angle ABY * 2 => sub #-1, #-3
angle DEI congruent angle ABY => division &amp;amp;amp;amp;amp; doca
triangle AYB similar triangle DIE => AA #-1, #-7
( m segment AB / m segment DE ) = ( m segment BY / m segment EI ) => s
+imilar triangles
( m segment BY / m segment EF ) = ( m segment BC / m segment EF ) = ( 
+m segment AC / m segment DF ) => sub #-1, #-5
( m segment BC / m segment EF ) = (    m segment CX / m segment FH ) =
+> similar reasoning #-2
( m segment AB / m segment DE ) = ( m segment CX / m segment FH ) = ( 
+m segment AC / m segment DF ) => similar reasoning #-2
( m segment AC / m segment DF ) = ( m segment AZ / m segment DJ ) => s
+imilar reasoning #-4
( m segment AB / m segment DE ) = ( m segment BC / m segment EF ) = ( 
+m segment AZ / m segment DJ ) => similar reasoning #-4
[download]

actual parser:

#!/usr/bin/perl -w
use strict;

my ($file_parse, $file_output, %def, $seen_pr, $tab, @stmts, @reasons,
+ $len_stmts, $len_reasons, @pad, $i);

$file_parse="untitled:Desktop Folder:test2";         # default to be r
+ead
$file_output="untitled:Desktop Folder:test2output";  # default to be w
+ritten
$tab=4;                                              # default tab len
+gth
$seen_pr=0;                                          # wether "Pr:" ha
+s been seen in while loop W
%def = (                                             # used for parsin
+g the symbols in $file_parse
    "T:" => "Title",
    "A:" => "Author(s)",
    "G:" => "Given(s)",
    "P:" => "Proof Statement(s)",
    "T1" => "Statements",
    "T2" => "Reasons",
    "point" => "·",
    "triangle" => "?",
    "angle" => "‹",
    "w" => "with",
    "m" => "measure",
    "s" => "segment"
);


#####
# Some code establishing what file you are working on and what to outp
+ut to
#####

print "Will parse $file_parse (y/n):";
$_=<>;
if (!/y/i) {
    print "Which file: ";
    $file_parse=<>;
    chomp $file_parse;
}

print "Will write $file_output (y/n):";
$_=<>;
if (!/y/i) {
    print "Which file: ";
    $file_output=<>;
    chomp $file_output;
}


#####
# Start a loop to go through the file to be parsed
#####

open (PARSE, "<$file_parse") or die ("Can't open $file_parse: $!\n");

open (OUTPUT, ">$file_output") or die ("Can't open $file_output: $!\n"
+);

W: while (<PARSE>) {
    chomp;

    s/^\s+//; #trim leading whitespace
    s/\s+$//; #trim trailing whitespace
    s/(\S+)\s{2,}/"$1 "/gie; #replace more then one space with one spa
+ce
    s/\s{2,}(\S+)/" $1"/gie; #replace more then one space with one spa
+ce

    foreach $i (split " ", $_) {
        if (exists $def{$i} and !/:/) {
            s/(^| )$i( |s|$)/"$1$def{$i}$2"/gie; #replace symbols defi
+ned in %def with there meaning
        }
    }

    if (!/pr:/i and $seen_pr==0) {                      
        if (/:/) {                                 # if $_ contains a 
+":" then regard as token and print out
            $_ = uc();
            print OUTPUT "\n" if !/t:/i;                   
            print OUTPUT $def{$_} . ":\n";
            next W;
        } else {                                   # Print the items u
+nder the token
            print OUTPUT " " x $tab . $_ . "\n"; 
            next W;
        }
    } elsif ($seen_pr==0) {
        $seen_pr=1;
        print OUTPUT "\n";
        next W;
    }


    m/(.+) => (.+)/; #format is statement => reason
    push @stmts, $1; #constuct an array of statements
    push @reasons, $2; #constuct an array of reasons
}
close (PARSE);


#####
# Find the longest statement and reason for formatting
#####

$len_stmts   = (sort {$b <=> $a} map {length} @stmts  )[0];
$len_reasons = (sort {$b <=> $a} map {length} @reasons)[0];


#####
# Really difficult to understand way of formatting output
#####

push @pad, ( ( length($#stmts+1) + $len_stmts + 4) / 2) - int( length(
+ $def{"T1"} ) / 2 )  - length( $def{"T1"} ) % 2;
push @pad, ( ( length($#stmts+1) + $len_stmts + 4) / 2) - int( length(
+ $def{"T1"} ) / 2 ) +1;
push @pad, ( ( length($#reasons+1) + $len_reasons + 2) / 2) - int( len
+gth( $def{"T2"} ) ) / 2 - length( $def{"T2"} ) % 2;
push @pad, ( ( length($#reasons+1) + $len_reasons + 2) / 2) - int( len
+gth( $def{"T2"} ) ) / 2;

print OUTPUT "_" x $pad[0] . $def{"T1"} . "_" x $pad[1] . "|" . "_" x 
+$pad[2] . $def{"T2"} . "_" x $pad[3] . "\n";

for ($i=0; $i<$#stmts+1; $i++) {
    $reasons[$i] =~ s/#-(\d+?)/($i+1)-$1/ge; #replace #-? with the lin
+e # you are on minus ?
    print OUTPUT $i+1 . "." . " " x (length($#stmts+1) - length($i+1)+
+1) . $stmts[$i] . " " x ($len_stmts - length($stmts[$i])) . "  |  " .
+ $reasons[$i] . "\n"; #print the statements and resons
}
close (OUTPUT);
[download]

In reply to Regex help/condensation by smgfc

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.