Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Need help with regex

by BastardOperator (Monk)
on Sep 21, 2000 at 23:42 UTC ( [id://33551]=perlquestion: print w/replies, xml ) Need Help??

BastardOperator has asked for the wisdom of the Perl Monks concerning the following question:

The phone directory at work is sent out as a Word document (why? I have _no idea_!?). I wrote this little perl script to grab phone numbers for me, it uses Text::Soundex. Due to the fact that I just export the word document to ASCII format, it's got a few quirks in it, that I've had a hard time representing with regex's. So, I ended up using a few .*'s in there and was wondering if some regex guru (Ovid? :^)) could show me the light. That and anyone that would like to take a poke at it, please feel free.

Here's the file format of the phone list
NAME              EXT  RM#    ORG       NAME              EXT  RM#    ORG     
 -------------------------------------   --------------------------------------
 - A -                                   BASILE, YYYY      5555 1H08   IAMG    
 ABEND, YYYYYY     5555 2014   CE        BATES, YYYY       5555 4832   BT      
 ABRAMS, YYYYY     5555 C-07             BATHERSFIELD, YY  5555 B-39   CE      
 ADAMS, YYYY       5555 255    OTC       BAXTER, YYYY      5555 A-43           
 ADAMS, YYYY       5555 149    BT        BEAR, YYYYYY      5555 H42    ATO     
 ADAMS, YYYYYYY    5555 A-16             BEASLEY, YYY      5555 D-79           
 ADUAKA, YYYYYYYY  5555 A-52             BEATTY, YY        5555 4832   TAG     
 AHMED, YYYYYY     5555 C-63             BECHTLE, YYYY     5555 D-26           
 AHMED, C. YYYYYY  5555 D-69   SOMEU     BEDOYA, YYYYYYYY  5555        CE
As you can see, they use things like "- A -" to denote the start of an alphabetical section, which I've had a hard time with. Also, some folks have middle initials, some don't, some have an organization, some don't.
Here's the code:
#!/usr/bin/env perl use strict; use Text::Soundex; my $PHONE_LIST = "$ENV{HOME}/phone_list"; my %seen; my $cntr; if(@ARGV == 0 || $ARGV[0] eq '-?' || $ARGV[0] eq '-h') { print "\nUsage: $0 <last name> ...\n\n"; exit 1; } open(PL, "$PHONE_LIST") || die "Can't open $PHONE_LIST: $!"; while(<PL>) { foreach my $name (@ARGV) { $name =~ /\U$name/; my $code = soundex($name); my $column_one = my $column_two = $_; my $flag; $column_one =~ s/^\s-*\s*(\w+)\s*-*,*.*$/$1/g; $column_two =~ s/^\s\w+,\s.* (\w+),.*/$1/g; if($column_one && soundex($column_one) eq $code) { $flag = 1; } elsif($column_two && soundex($column_two) eq $code) { $flag = 2; } if($flag) { $seen{$name}++; if(++$cntr == 1) { printf("\n%-12s %-15s %-8s %-5s\n", "First", "Last", "Ext", "Rm#"); print "=" x 43, "\n"; } if($flag == 1) { s/^\s(\w+), (\w+)\s+(\w?)\s?(\d{4,10})\s+(\S+)\s*.*$/$ +2 $1 $3 $4 $5/g; } elsif($flag == 2) { s/^\s.*\s(\w+), (\w+)\s+(\w?)\s?(\d{4,10})\s+(\S*)\s+. +*$/$2 $1 $3 $4 $5/g; } my ($fname, $lname, $m, $num, $room) = split(/ /); chomp($room); printf("%-10s %1s %-15s %-8d %-5s\n", $fname, $m, $lname, $num, $room); } } } foreach my $name (@ARGV) { $name =~ /\U$name/; if(! exists($seen{$name})) { print "\nNot found: $name"; } } print "\n"; close(PL);
This is really a quick hack, so it's not as clean as I'd like.

Replies are listed 'Best First'.
Re: Need help with regex
by tye (Sage) on Sep 22, 2000 at 01:18 UTC

    Since this seems to be fixed-width columns, I'd probably go with tilly's suggestion of unpack. Here is a stab at a regex solution that I'd probably not use:

    #!/usr/bin/perl -w use strict; # First, we will allow no digits anywhere in a name; # this will allow us to detect the extension after the # name. Second, we only allow single spaces in a name # (and can't start or end with a space). Third, names # must contain a comma (but not in front). my $name= qr{ ( [^\s\d] (?: \s?[^\d\s]+ )* ) , ( (?: \s?[^\d\s]+ )* ) }x; # Org can only have single spaces and no commas, but is optional: my $org= qr{ (?: [^\s,](?:\s?[^\s,]+)*[^\s,] )? }x; # Heading must have "-" on each end and just one word between: my $head= qr{ -\s+[a-z]+\s+- }ix; my $entry= qr{ \s* (?: $head | $name \s+ (\d+) \s+ (\S*) \s+ ($org) ) }x; #print "$entry\n"; while( <DATA> ) { my @matches= m/^$entry$entry\s*$/; #print "$_"; for( [0..4], [5..9] ) { my( $last, $first, $ext, $room, $org )= map { defined $_ ? $_ : "" } @matches[@$_]; if( "" ne $last ) { print "($last), ($first) ($ext) ($room) ($org)\n"; } } } __END__ NAME EXT RM# ORG NAME EXT RM# +ORG ------------------------------------- ----------------------------- +--------- - A - BASILE, YYYY 5555 1H08 + IAMG ABEND, YYYYYY 5555 2014 CE BATES, YYYY 5555 4832 + BT ABRAMS, YYYYY 5555 C-07 BATHERSFIELD, YY 5555 B-39 + CE ADAMS, YYYY 5555 255 OTC BAXTER, YYYY 5555 A-43 + ADAMS, YYYY 5555 149 BT BEAR, YYYYYY 5555 H42 + ATO ADAMS, YYYYYYY 5555 A-16 BEASLEY, YYY 5555 D-79 + ADUAKA, YYYYYYYY 5555 A-52 BEATTY, YY 5555 4832 + TAG AHMED, YYYYYY 5555 C-63 BECHTLE, YYYY 5555 D-26 + AHMED, C. YYYYYY 5555 D-69 SOMEU BEDOYA, YYYYYYYY 5555 + CE

    Which prints the following:

    (BASILE), ( YYYY) (5555) (1H08) (IAMG) (ABEND), ( YYYYYY) (5555) (2014) (CE) (BATES), ( YYYY) (5555) (4832) (BT) (ABRAMS), ( YYYYY) (5555) (C-07) (BATHERSFIEL) (D), ( YY) (5555) (B-39) (CE) (ADAMS), ( YYYY) (5555) (255) (OTC) (BAXTER), ( YYYY) (5555) (A-43) () (ADAMS), ( YYYY) (5555) (149) (BT) (BEAR), ( YYYYYY) (5555) (H42) (ATO) (ADAMS), ( YYYYYYY) (5555) (A-16) (BEASLE) (Y), ( YYY) (5555) (D-79) () (ADUAKA), ( YYYYYYYY) (5555) (A-52) (BEATT) (Y), ( YY) (5555) (4832) (TAG) (AHMED), ( YYYYYY) (5555) (C-63) (BECHTL) (E), ( YYYY) (5555) (D-26) () (AHMED), ( C. YYYYYY) (5555) (D-69) (SOMEU) (BEDOYA), ( YYYYYYYY) (5555) (CE) ()

    Finally, $name =~ /\U$name/; doesn't do anything useful. You want $name= uc $name;.

            - tye (but my friends call me "Tye")
Re (tilly) 1: Need help with regex
by tilly (Archbishop) on Sep 22, 2000 at 00:16 UTC
    Is that space delimited or tab? If space delimited then use unpack to grab columns, if tab delimited I would first expand spaces then use unpack again.
Re: Need help with regex
by Fastolfe (Vicar) on Sep 22, 2000 at 01:04 UTC
    It seems to me that it might be easier to pre-parse this, then examine your data structure to see what matches. Treating it like a single string might make it easier also:
    local($/); my $data = <PL>; while ($data =~ / \s*(?:-\s.\s-)? # eat up "- A -" \s*(\w+), # eats up spaces, last name in $1 \s(?:(\w+\.)\s)? # MI. in $2 (if any) (\w+) # first name in $3 \s*(\d+) # phone number in $4 \s+(\S+) # room number in $5 \s{1,6}(\S*) # grabs the next "close" string, ORG in $6 /sxg) { # store and/or check to see if this is our guy }
    The while loop in conjunction with the g regex modifier will keep the test going through the entire string, which is effectively all of your data. I haven't tested that regexp, but it "seems" correct.. Fix it up if you need to.
Re: Need help with regex
by BastardOperator (Monk) on Sep 22, 2000 at 18:34 UTC
    Update: I went with tilly's suggestion to use unpack, however I'm not sure that I did this efficiently. I first did an unpack into 2 columns, then broke it up from there, unpacking each column. This actually runs slower than my regex version, but it is a little nicer on the eyes. Suggestions are very welcome!
    #!/usr/bin/env perl use strict; use Text::Soundex; my $PHONE_LIST = "$ENV{HOME}/phone_list"; my %seen; my $cntr; if(@ARGV == 0 || $ARGV[0] eq '-?' || $ARGV[0] eq '-h') { print "\nUsage: $0 <last name> ...\n\n"; exit 1; } open(PL, "$PHONE_LIST") || die "Can't open $PHONE_LIST: $!"; while(<PL>) { foreach my $name (@ARGV) { my $code = soundex($name); my @name_list; undef @name_list; my ($column_one, $column_two) = unpack("a40a40", $_); if($code eq soundex(split(/,/, $column_one))) { @name_list = unpack("a18a5a7", $column_one); } elsif($code eq soundex(split(/,/, $column_two))) { @name_list = unpack("a18a5a7", $column_two); } if(defined(@name_list)) { $seen{$name}++; if(++$cntr == 1) { printf("\n %-24s %-9s %-5s\n ", "Last, First M", "Ext", "Rm#"); print "=" x 40, "\n"; } printf("%-25s %-8d %-5s\n", @name_list); } } } foreach my $name (@ARGV) { if(! exists($seen{$name})) { print "\nNot found: $name"; } } print "\n"; close(PL);

      I think the above code has one problem in that it won't report two matches on the same line. You might prefer:

      for my $column ( unpack("a40a40", $_) ) { if( $code eq soundex( (split(/,/, $column))[0] ) ) { @name_list = unpack("a18a5a7", $column); $seen{$name}++; if(++$cntr == 1) { printf("\n %-24s %-9s %-5s\n ", "Last, First M", "Ext", "Rm#"); print "=" x 40, "\n"; } printf("%-25s %-8d %-5s\n", @name_list); } }

      Note also that defined @name_list is not a good idea. It might work in this specific example with your version of perl, but using defined on aggregate types (arrays and hashes) is documented as not a good idea. If you don't use my code above, change the if(defined(@name_list)) to if(0 < @name_list) or just if(@named_list)

              - tye (but my friends call me "Tye")
Re: Need help with regex
by Fastolfe (Vicar) on Sep 21, 2000 at 23:45 UTC
    If they're using Word, then obviously they have MS Office.. Why in the WORLD aren't they using Excel??? Heh. That would make this task infinitely easier.
Re: Need help with regex
by Anonymous Monk on Sep 22, 2000 at 02:16 UTC
    I've had a similar thing I just used substr to get my info.
      Trust me, unpack is nicer if you need to get a lot of fields. For this kind of text your format can simply just use "a" and "x". For instance to get 5, skip 5, then get 5 characters use "a5x5a5". Very handy for turning columns into lists... Cheers, Ben

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://33551]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-29 13:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found