songahji has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I am wondering is there any module that collects a country list from a paragraph or stream of data?

Ex: I want Mexico but not New Mexico
I wonder there must be other super set thing like this

Hmm, I could have generate the list of country manually and start regex loop.

sub get_country { my $data = shift; # let $data be one line of data, thus open IN, ">country.lis"; @country_list = <IN>; # <IN> is file handle reading the country li +st; close IN; for $c (@country_list) { if ($data =~ m/(\w+)\s+$c\s+/) { next if (($c eq 'Mexico') && (lc($1) eq 'new' ) ); push(@my_bucket, $c); } } return \@my_bucket; }
Cheers,
HJ

Replies are listed 'Best First'.
Re: Collecting Country in a paragraph
by ikegami (Patriarch) on Apr 25, 2006 at 21:08 UTC

    I'll address the basics, since there's a lot of work/learning to be done there.

    • I opened IN for reading. You had it open for writting.
    • I initilized (by localizing) @my_bucket to allow get_country to be called more than once.
    • I added error checking.
    • I switched to the safer 3 arg open.
    • I localized IN.
    • Now that IN is localized, close IN; is optional. I removed it.
    • I used while (<IN>) instead of loading the entire file into memory. That was needlessly inefficient.
    sub get_country { my $data = shift; open(local *IN, '<', 'country.lis') or die("Unable to open list of countries country.lis: $!\n"); my @my_bucket; while (defined(my $c = <IN>)) { ... } return \@my_bucket; }
      Well spotted.
      Thanks for the correction! ++
Re: Collecting Country in a paragraph
by davidrw (Prior) on Apr 25, 2006 at 20:52 UTC
    m/(\w+)\s+$c\s+/ won't work because of cases like this:
    Mexico is my favorite country. I like Mexico. Welcome to Mexico--a fun place.
    You need a negative look-behind for excluding 'New Mexico' (see perlre), and in general I suggest just having a list of regex's to check against:
    my @countries = ( # make a list of regex's qr/(?<!New )Mexico/, 'France', 'Germany', qr/(?:(?:U\.S\.(?:A\.)?)|USA?|(?:The )?United States(?: of America)? +)/, ); my $s = do {local $/=undef; <DATA>}; my %found; foreach my $re ( @countries ){ $found{$1}++ while $s =~ s/\b($re)\b/====/s; # NOTE: this is destru +ctive to $s, but does get us the counts. } use Data::Dumper; print Dumper \%found; __DATA__ Mexico is my favorite country. I like Mexico. Welcome to Mexico--a fun place. New Mexico US France Germany U.S.A. USA U.S. US United States United States of America The United States The United States of America
    Hmm.. /me thinks i should add making a Regexp::Common::Country module to my project list
      Yep, I like the way you do it! ++
Re: Collecting Country in a paragraph
by Schuk (Pilgrim) on Apr 25, 2006 at 20:33 UTC
    Hi,

    on CPAN I found Country. This might help.It has a country object which you could use. If you are successfull to create the object you have a valid country.
    Dont know how your data is arranged so getting a single country out of a sentence might be a problem. However you could try to check only on words which begin with a uppercase (in case your countrys are correctly spelled)

    Havent tried that though.

    Cheers
    Schuk