If there were enough data to warrant an external application then I would be looking to use a database.
I agree. It's easy enough. It's Friday, i got my lunchbreak, so why not give the OP an example (using PostgreSQL, since that's what i have installed). Since OP didn't give us example files, let's just create some data out of thin air (zipcodes and census data) for our lovely tropical island. The censusdata is missing the zipcodes (because our friend Penultimo doesn't understand them) but it has valid city names.
#!/usr/bin/env perl
use strict;
use warnings;
use Crypt::Digest::SHA256 qw[sha256_hex];
use Carp;
# Make a fake zipcode list
print "Writing ZIP codes\n";
open(my $zipfh, '>', 'zipcodes.txt') or croak($!);
for(my $i = 10_000; $i <= 99_999; $i++) {
# Fake a city name ;-)
my $cityname = sha256_hex($i);
print $zipfh "$i;$cityname\n";
}
close $zipfh;
# Make a fake data file
print "Writing data\n";
open(my $datafh, '>', 'census.txt') or croak($!);
for(my $i = 0; $i < 1_000_000; $i++) {
my $zip = int(rand(89_999)) + 10_000;
my $cityname = sha256_hex($zip);
my $population = int(rand(1_000_000));
print $datafh "$cityname;$population\n";
}
close $datafh;
print "Done\n";
Next step is to log in to our database, create some tables and insert the data from the files. The last step is to add the missing zipcodes into the census data table using the city names.
#!/usr/bin/env perl
use strict;
use warnings;
use Carp;
use DBI;
my $dbh = DBI->connect("dbi:Pg:dbname=TropicoDB;host=/var/run/postgres
+ql", 'ElPresidente', 'WorldDomination', {AutoCommit => 0})
or croak($!);
my @stmts = (
"CREATE TABLE IF NOT EXISTS zipcodes (
zipcode text NOT NULL,
cityname text NOT NULL,
CONSTRAINT zipcodes_pk PRIMARY KEY(zipcode)
)",
"CREATE TABLE IF NOT EXISTS censusdata (
census_id bigserial,
zipcode text,
cityname text NOT NULL,
peoplecount bigint NOT NULL,
CONSTRAINT census_pk PRIMARY KEY(census_id)
)",
# Empty tables if they already have data
"TRUNCATE zipcodes",
"TRUNCATE censusdata",
);
foreach my $stmt (@stmts) {
print "Running $stmt...\n";
$dbh->do($stmt) or croak($dbh->errstr);
}
print "Loading zipcodes into database...\n";
my $zipsth = $dbh->prepare("INSERT INTO zipcodes (zipcode, cityname) V
+ALUES (?, ?)")
or croak($dbh->errstr);
open(my $zipfh, '<', 'zipcodes.txt') or croak($!);
while((my $line = <$zipfh>)) {
chomp $line;
my ($zipcode, $cityname) = split/\;/, $line;
if(!$zipsth->execute($zipcode, $cityname)) {
croak($dbh->errstr);
}
}
close $zipfh;
print "Loading census data into database...\n";
my $censussth = $dbh->prepare("INSERT INTO censusdata (cityname, peopl
+ecount) VALUES (?, ?)")
or croak($dbh->errstr);
open(my $censusfh, '<', 'census.txt') or croak($!);
while((my $line = <$censusfh>)) {
chomp $line;
my ($cityname, $population) = split/\;/, $line;
if(!$censussth->execute($cityname, $population)) {
croak($dbh->errstr);
}
}
close $censusfh;
print "Adding missing zipcodes to census data...\n";
$dbh->do("UPDATE censusdata c SET zipcode = z.zipcode FROM zipcodes z
+WHERE c.cityname = z.cityname")
or croak($dbh->errstr);
$dbh->commit;
print "Done\n";
Now that we have the data in the database, it's very simple to run some reports. El Presidente needs to know which areas of his island are overpopulated and where to move those people. He also wants to know if any areas are still missing census reports.
#!/usr/bin/env perl
use strict;
use warnings;
use Carp;
use DBI;
my $dbh = DBI->connect("dbi:Pg:dbname=TropicoDB;host=/var/run/postgres
+ql", 'ElPresidente', 'WorldDomination', {AutoCommit => 0})
or croak($!);
generateReport($dbh, "Most populated zip codes", "SELECT zipcode, sum(
+peoplecount) FROM censusdata
GROUP BY zipcode
ORDER BY sum(peoplec
+ount) desc
LIMIT 10");
generateReport($dbh, "Least populated zip codes", "SELECT zipcode, sum
+(peoplecount) FROM censusdata
GROUP BY zipcode
ORDER BY sum(peoplec
+ount)
LIMIT 10");
generateReport($dbh, "Zip codes without census", "SELECT z.zipcode, z.
+cityname FROM zipcodes z
WHERE NOT EXISTS (
SELECT 1 FROM cens
+usdata c
WHERE c.zipcode =
+z.zipcode
)");
$dbh->commit;
sub generateReport {
my ($dbh, $title, $stmt) = @_;
print $title, "\n";
print "=" x length($title), "\n";
my $sth = $dbh->prepare($stmt) or croak($dbh->errstr);
$sth->execute or croak($dbh->errstr);
while((my @line = $sth->fetchrow_array)) {
print join(';', @line), "\n";
}
$sth->finish;
print "\n\n";
}
Results:
Most populated zip codes
========================
15864;16121159
50274;15581665
90418;15213432
47146;14784133
79867;14738692
66935;14603419
25969;14572689
29868;14531580
63780;14452431
95050;14350914
Least populated zip codes
=========================
26940;110605
10822;158808
11776;170321
43384;187501
61853;211522
34028;227782
21339;228705
61363;319850
22065;322939
42693;329595
Zip codes without census
========================
99999;fd5f56b40a79a385708428e7b32ab996a681080a166a2206e750eb4819186145