Valentine Sinemani:
All of this advice above is good, using a mysql database engine. LOAD DATA will do mass imports. My issue with it is that (I believe that) single errors will cause an entire import to fail. I work with lots of voter files on a regular basis. And I use perl, not php, as does your friend.
You mention a couple of issues:
Timeouts: get the web browser out of the way between your import script and the database engine. Try to get direct shell level access to the database. It will save you days of frustration.
pdf-ocr-csv conversion: Consider whether pdftotext (a utility a part of the xpdf library in linux) might save you the trouble of dealing with the optical character recognition work. You can handle the pdf's as simple electronic files without the need to print them and then rescan them into electronic files. This may not work depending on the nature of the pdf's you were given. You will then need to parse the resulting data into your csv format. You might find that awk and sed are at least as useful for that purpose as perl.
Working with the imports: My _build_insert() function below escapes the single quotes as they occur. With the caveat that I am working in perl against a postgresql database engine, while you are working in php on a mysql db engine, here is a sample script I use for voter file imports:
#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
use DBI;
my $file = '/home/hesco/lists/voters/ca/psr00693.txt';
my $db = 'voters';
my $host = '127.0.0.1';
my $user = 'vtr_user';
my $pw = 'secret';
my $dsn = "dbi:Pg:database=$db;host=$host";
my $dbh = DBI->connect($dsn, $user, $pw,
{ RaiseError => 0, PrintError => 0, AutoCommit => 0 }) o
+r die $DBI::errstr;
my $precinct_sql =<<EOQ;
SELECT precinct_ndx
FROM precinct
WHERE precinct = ? AND precinct_part = ?;
EOQ
my @fields = ();
my $counter = 0;
my $field_names;
my ($insert_reg,$sth,$prec_id);
my ($insert_address,$phone,$insert_hy,$t);
my ($m,$d,$y);
my $restart = 0;
open ('VOTERS','<',$file);
while (<VOTERS>){
$counter++;
@fields = split ",",$_; # for csv
# @fields = split "\t",$_; # for tab delimited
my($ndx);
if($counter == 1) {
foreach my $ndx (0 .. (0 + @fields)){
$fields[$ndx] =~ s/\ *$//;
$fields[$ndx] =~ s/\s*$//;
$fields[$ndx] =~ s/'/\\'/;
}
my $field_names_tmp = \@fields;
# print STDERR "\$field_names_tmp is: ", Dumper($field_names_tmp);
foreach my $i (0 .. (0 + @fields -1)){
$field_names->{'ndx'}->{$i} = $fields[$i];
$field_names->{'field_name'}->{$fields[$i]} = $i;
}
# print "\$field_names is: \n", Dumper(\$field_names);
foreach my $i (0 .. (0 + @fields -1)){
print "$i -- $fields[$i]\n";
}
next;
}
# registrants table
my @reg_indexes = (0,1,25,2,3,4,5,5,26,27,28,29,30,31,34,35,36,39);
my @hy_indexes = (1,58,59,60,61,62,63,64,65);
my @addr_indexes = (1,6,7,8,9,10,11,12,13,14,15,16);
# define the indexes for any other tables
# into which you want to normalize this data
$sth = $dbh->prepare($precinct_sql);
$sth->execute($fields[32],$fields[33]);
($prec_id) = $sth->fetchrow_array();
if(!defined($prec_id)){ $prec_id = -1; }
$insert_reg = "INSERT INTO registrant VALUES(";
$insert_reg = _build_insert(\@fields,$insert_reg,@reg_indexes);
$insert_reg =~ s/,$/,$prec_id);/;
# print "\$insert_reg is: " . $insert_reg . "\n";
# print "registrant_id is: " . $fields[1] . ":";
$sth = $dbh->prepare($insert_reg);
$sth->execute() or warn "\$sql is: $insert_reg, error is: " . $dbh->
+errstr;
print "\$registrant_id is: " . $fields[1] . "\t" . $phone . "\n"; #
+, $insert_hy,"\n";
$dbh->commit;
}
1;
sub _build_insert {
my($fields, $insert, @indexes) = @_;
foreach my $field (@indexes){
unless(defined($fields[$field])){
$insert .= "'',";
next;
}
$fields[$field] =~ s/\ *$//;
$fields[$field] =~ s/\s*$//;
$fields[$field] =~ s/'/\\'/;
$insert .= "'$fields[$field]',";
}
return $insert;
}
This is the basics. Its simple and proceeds by brute force, one record at a time. Its willing to permit an occasional error fall by the way side, while it continues to process records. By adding a condition early on to check for a particular precinct or political party, I am able to import only the records I need. I used a variation on this script last week and imported 50,000 records in about an hour.
Thank you for being engaged in the process, against such incredibly violent opposition.
Yours for a just and sustainable future,
-- Hugh Esco
CampaignFoundations.com
if( $lal && $lol ) { $life++; }
| [reply] [d/l] |
In the unlikely event that you continue to have trouble with this problem I would note that it may be possible to have trusted people outside your country work on it and send back the results in formats that should be relatively easy to load into MySQL. MySQL has backup facilities, including a program called mysqldump, that dump records from a table or database in text formats that should load fairly reliably. Some formats are included below. Be aware that different countries may have different date, time and currency formats which are controlled by MySQL locale variables.
Of course I would expect that there would be concerns about the trustworthiness of those doing the imports and forwarding them in such a case. Other than the issue of trust, I would be happy to help as, I think, would others.
My expertise in this kind of approach is somewhat limited and I welcome commentary on this suggestion.
Some sample mysqldump formats:
(As recommended by another poster, data has been cleansed to attempt to mask voter identities)
INSERT INTO `voters` VALUES ('GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','202 XXXXXXXXXX XXXX COURT HARARE','640463','7','MT PLEASANT');
INSERT INTO `voters` VALUES ('O\'GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','19 XXXXX STRATHAVEN HARARE','640463','7','MT PLEASANT');
INSERT INTO `voters` VALUES ('WXXXXX','HOPE','','F','1985-07-23',23,7,1985,'XX- XXXXXXX XX','17 XXXXXXX DR STRATHAVEN HARARE','640463','7','MT PLEASANT');
INSERT INTO `voters` VALUES ('MXXXXXXXXXX','KEVIN','XXXXXXXX','M','1981-11-18',18,11,1981,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT');
INSERT INTO `voters` VALUES ('DXXXX','NELSON','','M','1986-03-30',30,3,1986,'XX- XXXXXXX XX','10A XXXXXXXXXXXXX ROAD STRATHAVEN HARARE','640463','7','MT PLEASANT');
INSERT INTO `voters` VALUES ('LXXXXXXXX','TAARIQ','','M','1987-08-14',14,8,1987,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT');
INSERT INTO `voters` VALUES ('RXXXXXX','SUVRO','XXXX','M','1988-07-22',22,7,1988,'XX- XXXXXXX XX','18 XXXXXXXX AVE STRATHAVEN HARARE','640463','7','MT PLEASANT');
INSERT INTO `voters` VALUES ('GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','202 XXXXXXXXXX XXXX COURT HARARE','640463','7','MT PLEASANT'),('O\'GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','19 XXXXX STRATHAVEN HARARE','640463','7','MT PLEASANT'),('WXXXXX','HOPE','','F','1985-07-23',23,7,1985,'XX- XXXXXXX XX','17 XXXXXXX DR STRATHAVEN HARARE','640463','7','MT PLEASANT'),('MXXXXXXXXXX','KEVIN','XXXXXXXX','M','1981-11-18',18,11,1981,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT'),('DXXXX','NELSON','','M','1986-03-30',30,3,1986,'XX- XXXXXXX XX','10A XXXXXXXXXXXXX ROAD STRATHAVEN HARARE','640463','7','MT PLEASANT'),('LXXXXXXXX','TAARIQ','','M','1987-08-14',14,8,1987,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT'),('RXXXXXX','SUVRO','XXXX','M','1988-07-22',22,7,1988,'XX- XXXXXXX XX','18 XXXXXXXX AVE STRATHAVEN HARARE','640463','7','MT PLEASANT');
| [reply] |