Collecting Country in a paragraph

songahji has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I am wondering is there any module that collects a country list from a paragraph or stream of data?

Ex: I want Mexico but not New Mexico
I wonder there must be other super set thing like this

Hmm, I could have generate the list of country manually and start regex loop.

sub get_country {
   my $data = shift;
   # let $data be one line of data, thus
   
   open IN, ">country.lis";
   @country_list = <IN>;  # <IN> is file handle reading the country li
+st;
   close IN;

for $c (@country_list) {
  if ($data =~ m/(\w+)\s+$c\s+/) {
      next if (($c eq 'Mexico') && (lc($1) eq 'new' ) );         
      push(@my_bucket, $c);
   }
}
return \@my_bucket;
}
[download]

Cheers,
HJ

Comment on Collecting Country in a paragraph Download Code

Replies are listed 'Best First'.
Re: Collecting Country in a paragraph by ikegami (Patriarch) on Apr 25, 2006 at 21:08 UTC
I'll address the basics, since there's a lot of work/learning to be done there. I opened `IN` for reading. You had it open for writting. I initilized (by localizing) `@my_bucket` to allow `get_country` to be called more than once. I added error checking. I switched to the safer 3 arg `open`. I localized `IN`. Now that `IN` is localized, `close IN;` is optional. I removed it. I used `while (<IN>)` instead of loading the entire file into memory. That was needlessly inefficient. `sub get_country { my $data = shift; open(local *IN, '<', 'country.lis') or die("Unable to open list of countries country.lis: $!\n"); my @my_bucket; while (defined(my $c = <IN>)) { ... } return \@my_bucket; }` [download]	[reply] [d/l] [select]
Re^2: Collecting Country in a paragraph by songahji (Friar) on Apr 26, 2006 at 14:40 UTC
Well spotted. Thanks for the correction! ++	[reply]
Re: Collecting Country in a paragraph by davidrw (Prior) on Apr 25, 2006 at 20:52 UTC
`m/(\w+)\s+$c\s+/` won't work because of cases like this: `Mexico is my favorite country. I like Mexico. Welcome to Mexico--a fun place.` [download] You need a negative look-behind for excluding 'New Mexico' (see perlre), and in general I suggest just having a list of regex's to check against: my @countries = ( # make a list of regex's qr/(?<!New )Mexico/, 'France', 'Germany', qr/(?:(?:U\.S\.(?:A\.)?)\|USA?\|(?:The )?United States(?: of America)? +)/, ); my $s = do {local $/=undef; <DATA>}; my %found; foreach my $re ( @countries ){ $found{$1}++ while $s =~ s/\b($re)\b/====/s; # NOTE: this is destru +ctive to $s, but does get us the counts. } use Data::Dumper; print Dumper \%found; __DATA__ Mexico is my favorite country. I like Mexico. Welcome to Mexico--a fun place. New Mexico US France Germany U.S.A. USA U.S. US United States United States of America The United States The United States of America [download] Hmm.. /me thinks i should add making a Regexp::Common::Country module to my project list	[reply] [d/l] [select]
Re^2: Collecting Country in a paragraph by songahji (Friar) on Apr 26, 2006 at 14:41 UTC
Yep, I like the way you do it! ++	[reply]
Re: Collecting Country in a paragraph by Schuk (Pilgrim) on Apr 25, 2006 at 20:33 UTC
Hi, on CPAN I found Country. This might help.It has a country object which you could use. If you are successfull to create the object you have a valid country. Dont know how your data is arranged so getting a single country out of a sentence might be a problem. However you could try to check only on words which begin with a uppercase (in case your countrys are correctly spelled) Havent tried that though. Cheers Schuk	[reply]