furry_marmot has asked for the wisdom of the Perl Monks concerning the following question:

Update: And that's why I hang out here. :-) Thanks to GrandFather and ikegami for pointing out the trees in my forest. After the street number, an address it may contain a unit/apt number, direction (SW/NE/etc), and multi-part street name separate from the street suffix (Blvd/Drive/etc). I have an address parsing function I use for another application; but the code below processes potentially several thousand addresses to aggregate market information and benchmarking showed parser was much slower here, where I'm just prettying up addresses for reports.

Somewhere along the line, I lost sight of the fact that the street suffix, if there is one, will always be at the end, obviating any need for context -- as it should be.

Hello Wise and Noble Monks.

I'm using map to break apart addresses from a variety of sources, standardize casing, abbreviating, and numbering, and then stitch them back together, like so:

# Rewritten $prop->{'address'} = # reassemble the address join ' ', map { s/^(#.+)/\U$1/; # for units, such +as #A or #215-C s/^(#?)0+([1-9]\S+)/$1$2/; # correct numberin +g, like 0000048th -> 48th s/^(mc|o')(.+$)/\u\L$1\E\u\L$2/i; # correct casing o +f O'Brien, McDonald, etc. s/\.+$//g; # remove literal d +ots at the end of elements. $_ # required at end +of map when used like this } map { ucfirst lc $_ } # proper case word +s, simple split ' ', $prop->{'address'}; # split on spaces # Normalize street suffixes to standard postal abbreviations # This is easier than creating a temp array. $prop->{'address'} =~ s/ (\w+)$ / if (defined $street_suf_lkup{lc $1}) +{ $street_suf_lkup{lc $1} } else { $1 } /ex;

This has worked fine for a few years. However, the other day, I came across the address 123 Circle Way. My program obediently changed it to 123 Cir Way, which then choked up the works further down the line.

My question is whether I can tell if there are any more elements coming through the map pipe, as it were. In other words, if Circle were the last element of the split address, it should be abbreviated, but if it's any other element (as in Circle Way), it should be left alone.

I fully realize there are other ways to do this by rewriting the code, but I like using map this way.

Any thoughts on this?

Thanks. Marmot.

Replies are listed 'Best First'.
Re: Counting elements being mapped via map{}
by GrandFather (Saint) on Sep 22, 2009 at 01:20 UTC

    The problem is that you have work that needs to be done for each element of the address and different work that should be done for specific elements. The solution is to separate the two sets of processing:

    use strict; use warnings; print mungeAddress($_) . "\n" for '123 Circle Way', '123 Way Circle'; sub mungeAddress { my %street_suf_lkup = (circle => 'Cir', boulevard => 'Blvd'); my @parts = map { # Apply the global corrections # proper case words, simple $_ = ucfirst lc $_; # if apartments or units, such as #A or #215-C, capitalize + unit # numbers. Don't strip off pound signs. s/^(#.+)/\U$1/; # correct numbering, like 0000048th -> 48th s/^(#?)0+([1-9]\S+)/$1$2/; # correct casing of O'Brien, McDonald, etc. s/^(mc|o')(.+$)/\u\L$1\E\u\L$2/i; # Remove literal dots at the end of elements. s/\.+$//g; $_; } split ' ', $_[0]; # Normalize street prefixes to standard postal abbreviations, for # example Circle -> Cir, Boulevard -> Blvd $parts[-1] = $street_suf_lkup{lc $parts[-1]} if exists $street_suf +_lkup{lc $parts[-1]}; return "@parts"; }

    Prints:

    123 Circle Way 123 Way Cir

    True laziness is hard work
Re: Counting elements being mapped via map{}
by ELISHEVA (Prior) on Sep 22, 2009 at 01:40 UTC

    Can you do it? Yes. Should you do it? No. map is appropriate for situations where the changes to each element in an array happen independently of the next. Furthermore, map would essentially be making you do redundant work.

    Using map for processing that requires look-ahead within the token array is stretching the presumed meaning of map and will make your code appear convoluted to future maintainers. Unless you plan on holding this job until your script is ready for the dust bins, I would advise against trying to cram your entire algorithm within map. I would recommend doing your processing in a foreach loop.

    As you are discovering, there is no way to process addresses without an awareness of context. Whether you do it with map or not, you are still going to have to make significant changes to your code to (a) check to see if a token can be normalized without knowing what is next (b) keep a list of pending tokens that haven't yet been normalized (c) determine when you have enough context to normalize any pending tokens.

    If you were to do this with map, then you also would need to modify map {...} to return a list of pending tokens that were normalized rather than a single scalar as it does now:

    my @aDone = map { push @aPending, $_; my @aNormalized; while (scalar(@aPending)) { my $sPending = shift @aPending; my $sNormalized = normalize($sPending, [@aPending]); if (!defined($sNormalized)) { #not enough information to normalize yet #so put the token back into the pending list unshift @aPending, $sPending; last; } push @aNormalized, $sNormalized; } @aNormalized; #return all normalized tokens } @aUnNormalized;

    If you look carefully at the above code, you will note that you are doing an extra copy. First you are pushing normalized elements into @aNormalized. Then you are appending @aNormalized to @aDone. Had you used a foreach loop instead of map you could eliminate one of the copies:

    my @aDone; foreach (@aUnNormalized) { push @aPending, $_; my @aNormalized; while (scalar(@aPending)) { my $sPending = shift @aPending; my $sNormalized = normalize($sPending, [@aPending]); if (!defined($sNormalized)) { #not enough information to normalize yet unshift @aPending, $sPending; last; } push @aDone, $sNormalized; } }

    Best, beth

Re: Counting elements being mapped via map{}
by ikegami (Patriarch) on Sep 22, 2009 at 01:21 UTC
    If a rule only applies to the last element, there's no reason to apply it to every element. Take it out of the map