comment on

Valentine Sinemani:

All of this advice above is good, using a mysql database engine. LOAD DATA will do mass imports. My issue with it is that (I believe that) single errors will cause an entire import to fail. I work with lots of voter files on a regular basis. And I use perl, not php, as does your friend.

You mention a couple of issues:

Timeouts: get the web browser out of the way between your import script and the database engine. Try to get direct shell level access to the database. It will save you days of frustration.

pdf-ocr-csv conversion: Consider whether pdftotext (a utility a part of the xpdf library in linux) might save you the trouble of dealing with the optical character recognition work. You can handle the pdf's as simple electronic files without the need to print them and then rescan them into electronic files. This may not work depending on the nature of the pdf's you were given. You will then need to parse the resulting data into your csv format. You might find that awk and sed are at least as useful for that purpose as perl.

Working with the imports: My _build_insert() function below escapes the single quotes as they occur. With the caveat that I am working in perl against a postgresql database engine, while you are working in php on a mysql db engine, here is a sample script I use for voter file imports:

#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
use DBI;

my $file = '/home/hesco/lists/voters/ca/psr00693.txt';

my $db = 'voters';
my $host = '127.0.0.1';
my $user = 'vtr_user';
my $pw = 'secret';
my $dsn = "dbi:Pg:database=$db;host=$host";
my $dbh = DBI->connect($dsn, $user, $pw,
              { RaiseError => 0, PrintError => 0, AutoCommit => 0 }) o
+r die $DBI::errstr;

my $precinct_sql =<<EOQ;
SELECT precinct_ndx
FROM precinct
WHERE precinct = ? AND precinct_part = ?;
EOQ

my @fields = ();
my $counter = 0;
my $field_names;
my ($insert_reg,$sth,$prec_id);
my ($insert_address,$phone,$insert_hy,$t);
my ($m,$d,$y);
my $restart = 0;
open ('VOTERS','<',$file);
while (<VOTERS>){
  $counter++;
  @fields = split ",",$_; # for csv
  # @fields = split "\t",$_; # for tab delimited
  my($ndx);
  if($counter == 1) {
    foreach my $ndx (0 .. (0 + @fields)){
      $fields[$ndx] =~ s/\ *$//;
      $fields[$ndx] =~ s/\s*$//;
      $fields[$ndx] =~ s/'/\\'/;
    }
    my $field_names_tmp = \@fields;
    # print STDERR "\$field_names_tmp is: ", Dumper($field_names_tmp);
    foreach my $i (0 .. (0 + @fields -1)){
      $field_names->{'ndx'}->{$i} = $fields[$i];
      $field_names->{'field_name'}->{$fields[$i]} = $i;
    }
    # print "\$field_names is: \n", Dumper(\$field_names);
    foreach my $i (0 .. (0 + @fields -1)){
      print "$i -- $fields[$i]\n";
    }
    next;
  }

  # registrants table
  my @reg_indexes = (0,1,25,2,3,4,5,5,26,27,28,29,30,31,34,35,36,39);
  my @hy_indexes = (1,58,59,60,61,62,63,64,65);
  my @addr_indexes = (1,6,7,8,9,10,11,12,13,14,15,16);
  # define the indexes for any other tables 
  # into which you want to normalize this data

  $sth = $dbh->prepare($precinct_sql);
  $sth->execute($fields[32],$fields[33]);
  ($prec_id) = $sth->fetchrow_array();
  if(!defined($prec_id)){ $prec_id = -1; }

  $insert_reg = "INSERT INTO registrant VALUES(";
  $insert_reg = _build_insert(\@fields,$insert_reg,@reg_indexes);
  $insert_reg =~ s/,$/,$prec_id);/;
  # print "\$insert_reg is: " . $insert_reg . "\n";
  # print "registrant_id is: " . $fields[1] . ":";
  $sth = $dbh->prepare($insert_reg);
  $sth->execute() or warn "\$sql is: $insert_reg, error is: " . $dbh->
+errstr;


  print "\$registrant_id is: " . $fields[1] . "\t" . $phone . "\n"; # 
+, $insert_hy,"\n";
  $dbh->commit;
}

1;

sub _build_insert {
  my($fields, $insert, @indexes) = @_;

  foreach my $field (@indexes){
    unless(defined($fields[$field])){
      $insert .= "'',";
      next;
    }
    $fields[$field] =~ s/\ *$//;
    $fields[$field] =~ s/\s*$//;
    $fields[$field] =~ s/'/\\'/;
    $insert .= "'$fields[$field]',";
  }

  return $insert;
}
[download]

This is the basics. Its simple and proceeds by brute force, one record at a time. Its willing to permit an occasional error fall by the way side, while it continues to process records. By adding a condition early on to check for a particular precinct or political party, I am able to import only the records I need. I used a variation on this script last week and imported 50,000 records in about an hour.

Thank you for being engaged in the process, against such incredibly violent opposition.

Yours for a just and sustainable future,
-- Hugh Esco
CampaignFoundations.com

if( $lal && $lol ) { $life++; }

In reply to Re: loading csv files into MySQL databse by hesco
in thread loading csv files into MySQL databse by zimvoter

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.