BastardOperator has asked for the wisdom of the Perl Monks concerning the following question:
The phone directory at work is sent out as a Word document (why? I have _no idea_!?). I wrote this little perl script to grab phone numbers for me, it uses Text::Soundex. Due to the fact that I just export the word document to ASCII format, it's got a few quirks in it, that I've had a hard time representing with regex's. So, I ended up using a few .*'s in there and was wondering if some regex guru ( Ovid? :^)) could show me the light. That and anyone that would like to take a poke at it, please feel free.
Here's the file format of the phone list
NAME EXT RM# ORG NAME EXT RM# ORG
------------------------------------- --------------------------------------
- A - BASILE, YYYY 5555 1H08 IAMG
ABEND, YYYYYY 5555 2014 CE BATES, YYYY 5555 4832 BT
ABRAMS, YYYYY 5555 C-07 BATHERSFIELD, YY 5555 B-39 CE
ADAMS, YYYY 5555 255 OTC BAXTER, YYYY 5555 A-43
ADAMS, YYYY 5555 149 BT BEAR, YYYYYY 5555 H42 ATO
ADAMS, YYYYYYY 5555 A-16 BEASLEY, YYY 5555 D-79
ADUAKA, YYYYYYYY 5555 A-52 BEATTY, YY 5555 4832 TAG
AHMED, YYYYYY 5555 C-63 BECHTLE, YYYY 5555 D-26
AHMED, C. YYYYYY 5555 D-69 SOMEU BEDOYA, YYYYYYYY 5555 CE
As you can see, they use things like "- A -" to denote the start of an alphabetical section, which I've had a hard time with. Also, some folks have middle initials, some don't, some have an organization, some don't.
Here's the code:
#!/usr/bin/env perl
use strict;
use Text::Soundex;
my $PHONE_LIST = "$ENV{HOME}/phone_list";
my %seen;
my $cntr;
if(@ARGV == 0 || $ARGV[0] eq '-?' || $ARGV[0] eq '-h') {
print "\nUsage: $0 <last name> ...\n\n";
exit 1;
}
open(PL, "$PHONE_LIST") || die "Can't open $PHONE_LIST: $!";
while(<PL>) {
foreach my $name (@ARGV) {
$name =~ /\U$name/;
my $code = soundex($name);
my $column_one = my $column_two = $_;
my $flag;
$column_one =~ s/^\s-*\s*(\w+)\s*-*,*.*$/$1/g;
$column_two =~ s/^\s\w+,\s.* (\w+),.*/$1/g;
if($column_one && soundex($column_one) eq $code) {
$flag = 1;
} elsif($column_two && soundex($column_two) eq $code) {
$flag = 2;
}
if($flag) {
$seen{$name}++;
if(++$cntr == 1) {
printf("\n%-12s %-15s %-8s %-5s\n",
"First", "Last", "Ext", "Rm#");
print "=" x 43, "\n";
}
if($flag == 1) {
s/^\s(\w+), (\w+)\s+(\w?)\s?(\d{4,10})\s+(\S+)\s*.*$/$
+2 $1 $3 $4 $5/g;
} elsif($flag == 2) {
s/^\s.*\s(\w+), (\w+)\s+(\w?)\s?(\d{4,10})\s+(\S*)\s+.
+*$/$2 $1 $3 $4 $5/g;
}
my ($fname, $lname, $m, $num, $room) = split(/ /);
chomp($room);
printf("%-10s %1s %-15s %-8d %-5s\n",
$fname, $m, $lname, $num, $room);
}
}
}
foreach my $name (@ARGV) {
$name =~ /\U$name/;
if(! exists($seen{$name})) {
print "\nNot found: $name";
}
}
print "\n";
close(PL);
This is really a quick hack, so it's not as clean as I'd like.
Re: Need help with regex
by tye (Sage) on Sep 22, 2000 at 01:18 UTC
|
Since this seems to be fixed-width columns, I'd probably go with
tilly's suggestion of unpack. Here is a stab at a regex solution
that I'd probably not use:
#!/usr/bin/perl -w
use strict;
# First, we will allow no digits anywhere in a name;
# this will allow us to detect the extension after the
# name. Second, we only allow single spaces in a name
# (and can't start or end with a space). Third, names
# must contain a comma (but not in front).
my $name= qr{
(
[^\s\d]
(?: \s?[^\d\s]+ )*
) , (
(?: \s?[^\d\s]+ )*
)
}x;
# Org can only have single spaces and no commas, but is optional:
my $org= qr{ (?: [^\s,](?:\s?[^\s,]+)*[^\s,] )? }x;
# Heading must have "-" on each end and just one word between:
my $head= qr{ -\s+[a-z]+\s+- }ix;
my $entry= qr{
\s*
(?:
$head
|
$name \s+ (\d+) \s+ (\S*) \s+ ($org)
)
}x;
#print "$entry\n";
while( <DATA> ) {
my @matches= m/^$entry$entry\s*$/;
#print "$_";
for( [0..4], [5..9] ) {
my( $last, $first, $ext, $room, $org )=
map { defined $_ ? $_ : "" } @matches[@$_];
if( "" ne $last ) {
print "($last), ($first) ($ext) ($room) ($org)\n";
}
}
}
__END__
NAME EXT RM# ORG NAME EXT RM#
+ORG
------------------------------------- -----------------------------
+---------
- A - BASILE, YYYY 5555 1H08
+ IAMG
ABEND, YYYYYY 5555 2014 CE BATES, YYYY 5555 4832
+ BT
ABRAMS, YYYYY 5555 C-07 BATHERSFIELD, YY 5555 B-39
+ CE
ADAMS, YYYY 5555 255 OTC BAXTER, YYYY 5555 A-43
+
ADAMS, YYYY 5555 149 BT BEAR, YYYYYY 5555 H42
+ ATO
ADAMS, YYYYYYY 5555 A-16 BEASLEY, YYY 5555 D-79
+
ADUAKA, YYYYYYYY 5555 A-52 BEATTY, YY 5555 4832
+ TAG
AHMED, YYYYYY 5555 C-63 BECHTLE, YYYY 5555 D-26
+
AHMED, C. YYYYYY 5555 D-69 SOMEU BEDOYA, YYYYYYYY 5555
+ CE
Which prints the following:
(BASILE), ( YYYY) (5555) (1H08) (IAMG)
(ABEND), ( YYYYYY) (5555) (2014) (CE)
(BATES), ( YYYY) (5555) (4832) (BT)
(ABRAMS), ( YYYYY) (5555) (C-07) (BATHERSFIEL)
(D), ( YY) (5555) (B-39) (CE)
(ADAMS), ( YYYY) (5555) (255) (OTC)
(BAXTER), ( YYYY) (5555) (A-43) ()
(ADAMS), ( YYYY) (5555) (149) (BT)
(BEAR), ( YYYYYY) (5555) (H42) (ATO)
(ADAMS), ( YYYYYYY) (5555) (A-16) (BEASLE)
(Y), ( YYY) (5555) (D-79) ()
(ADUAKA), ( YYYYYYYY) (5555) (A-52) (BEATT)
(Y), ( YY) (5555) (4832) (TAG)
(AHMED), ( YYYYYY) (5555) (C-63) (BECHTL)
(E), ( YYYY) (5555) (D-26) ()
(AHMED), ( C. YYYYYY) (5555) (D-69) (SOMEU)
(BEDOYA), ( YYYYYYYY) (5555) (CE) ()
Finally, $name =~ /\U$name/; doesn't do anything useful. You
want $name= uc $name;.
-
tye
(but my friends call me "Tye")
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re (tilly) 1: Need help with regex
by tilly (Archbishop) on Sep 22, 2000 at 00:16 UTC
|
Is that space delimited or tab? If space delimited then use unpack to grab columns, if tab delimited I would first expand spaces then use unpack again. | [reply] [Watch: Dir/Any] |
Re: Need help with regex
by Fastolfe (Vicar) on Sep 22, 2000 at 01:04 UTC
|
It seems to me that it might be easier to pre-parse this, then examine your data structure to see what matches. Treating it like a single string might make it easier also:
local($/);
my $data = <PL>;
while ($data =~ /
\s*(?:-\s.\s-)? # eat up "- A -"
\s*(\w+), # eats up spaces, last name in $1
\s(?:(\w+\.)\s)? # MI. in $2 (if any)
(\w+) # first name in $3
\s*(\d+) # phone number in $4
\s+(\S+) # room number in $5
\s{1,6}(\S*) # grabs the next "close" string, ORG in $6
/sxg) {
# store and/or check to see if this is our guy
}
The while loop in conjunction with the g regex modifier will keep the test going through the entire string, which is effectively all of your data. I haven't tested that regexp, but it "seems" correct.. Fix it up if you need to. | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Need help with regex
by BastardOperator (Monk) on Sep 22, 2000 at 18:34 UTC
|
Update: I went with tilly's suggestion to use unpack, however I'm not sure that I did this efficiently. I first did an unpack into 2 columns, then broke it up from there, unpacking each column. This actually runs slower than my regex version, but it is a little nicer on the eyes. Suggestions are very welcome!
#!/usr/bin/env perl
use strict;
use Text::Soundex;
my $PHONE_LIST = "$ENV{HOME}/phone_list";
my %seen;
my $cntr;
if(@ARGV == 0 || $ARGV[0] eq '-?' || $ARGV[0] eq '-h') {
print "\nUsage: $0 <last name> ...\n\n";
exit 1;
}
open(PL, "$PHONE_LIST") || die "Can't open $PHONE_LIST: $!";
while(<PL>) {
foreach my $name (@ARGV) {
my $code = soundex($name);
my @name_list;
undef @name_list;
my ($column_one, $column_two) = unpack("a40a40", $_);
if($code eq soundex(split(/,/, $column_one))) {
@name_list = unpack("a18a5a7", $column_one);
} elsif($code eq soundex(split(/,/, $column_two))) {
@name_list = unpack("a18a5a7", $column_two);
}
if(defined(@name_list)) {
$seen{$name}++;
if(++$cntr == 1) {
printf("\n %-24s %-9s %-5s\n ",
"Last, First M", "Ext", "Rm#");
print "=" x 40, "\n";
}
printf("%-25s %-8d %-5s\n", @name_list);
}
}
}
foreach my $name (@ARGV) {
if(! exists($seen{$name})) {
print "\nNot found: $name";
}
}
print "\n";
close(PL);
| [reply] [Watch: Dir/Any] [d/l] |
|
for my $column ( unpack("a40a40", $_) ) {
if( $code eq soundex( (split(/,/, $column))[0] ) ) {
@name_list = unpack("a18a5a7", $column);
$seen{$name}++;
if(++$cntr == 1) {
printf("\n %-24s %-9s %-5s\n ",
"Last, First M", "Ext", "Rm#");
print "=" x 40, "\n";
}
printf("%-25s %-8d %-5s\n", @name_list);
}
}
Note also that defined @name_list is not a good idea. It might work in this specific example with your version of perl, but using defined on aggregate types (arrays and hashes) is documented as not a good idea. If you don't use my code above, change the
if(defined(@name_list)) to
if(0 < @name_list) or just
if(@named_list)
-
tye
(but my friends call me "Tye") | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Need help with regex
by Fastolfe (Vicar) on Sep 21, 2000 at 23:45 UTC
|
If they're using Word, then obviously they have MS Office.. Why in the WORLD aren't they using Excel??? Heh. That would make this task infinitely easier. | [reply] [Watch: Dir/Any] |
Re: Need help with regex
by Anonymous Monk on Sep 22, 2000 at 02:16 UTC
|
I've had a similar thing I just used substr to get my info. | [reply] [Watch: Dir/Any] |
|
Trust me, unpack is nicer if you need to get a lot of
fields. For this kind of text your format can simply just use "a" and "x". For instance to get 5, skip 5, then get 5 characters use "a5x5a5". Very handy for turning columns into lists...
Cheers,
Ben
| [reply] [Watch: Dir/Any] |
|
|