Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Skip problematic lines while populating database

by Anonymous Monk
on Aug 17, 2020 at 12:43 UTC ( #11120825=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am trying to populate a table with a rather large (~60GB) file. I would like to ask if there is a way to populate the table, even if, say 1% of the lines are erroneous, without the script dying.
My code so far is:
#!/usr/bin/perl use strict; use warnings; use DBI; my $numArgs = $#ARGV + 1; die "Usage: ./ <CSV_FILE>" if ($numArgs != 1); my $infile = $ARGV[0]; my $logfile = $infile.'.log'; my $driver = "Pg"; my $database = "testdb"; my $dsn = "DBI:$driver:dbname = $database;port = 123456"; my $userid = "xxx"; my $password = "yyy"; my $dbh = DBI->connect($dsn, $userid, $password, { RaiseError => 1 }) or die $DBI::errstr; open LOG, ">$logfile"; print LOG "Opened database successfully\n"; my $count=0; open (DB_IN, "<$infile" ) || die ("File not found: '$infile'\n"); readline DB_IN; while(<DB_IN>) { my $input_line=$_; chomp $input_line; my @array_of_input = split(/\t/, $input_line); $count++; my $pid_field = $array_of_input[0]; my $quantityid_field = $array_of_input[1]; my $labid_field = $array_of_input[2]; my $system_field = $array_of_input[3]; my $component_field = $array_of_input[4]; my $numericvalue_field = $array_of_input[6]; my $ref_lower_field = $array_of_input[7]; my $ref_upper_field = $array_of_input[8]; my $shown_value_field = $array_of_input[9]; my $date_field = $array_of_input[10]; my $time_field = $array_of_input[11]; my $database_field = $array_of_input[12]; my $componentdb_field = $array_of_input[13]; my $componenttranslationlookup_field = $array_of_input[14]; my $testtype_field = $array_of_input[15]; my $unitclean_field = $array_of_input[16]; my $shownclean_field = $array_of_input[17]; my $reflowerclean_field = $array_of_input[18]; my $refupperclean_field = $array_of_input[19]; my $intervaltype_field = $array_of_input[20]; my $flag_field = $array_of_input[21]; my $abo_field = $array_of_input[22]; my $rhesus_field = $array_of_input[23]; my $dob_field = $array_of_input[24]; my $sex_field = $array_of_input[25]; my $insert_stmt = " INSERT INTO biochemical VALUES (".$count.", '".$pid_field."', '".$quantityid_field."', '".$labid_field."', '".$system_field."', '".$component_field."', '".$numericvalue_field."', '".$ref_lower_field."', '".$ref_upper_field."', '".$shown_value_field."', '".$date_field."', '".$time_field."', '".$database_field."', '".$componentdb_field."', '".$componenttranslationlookup_fiel +d."', '".$testtype_field."', '".$unitclean_field."', '".$shownclean_field."', '".$reflowerclean_field."', '".$refupperclean_field."', '".$intervaltype_field."', '".$flag_field."', '".$abo_field."', '".$rhesus_field."', '".$dob_field."', '".$sex_field."')"; my $rv = $dbh->do($insert_stmt); if(!$rv) { print "Problem inserting: $input_line\n"; } } close DB_IN; close LOG;
This code does work, but it dies as soon as an erroneous line is found. Ideally, I would like this line to just be thrown into my LOG and then the script continues to execute. Is there a way? In that way, I could leave the script run unattended and then check the lines that were not able to be inserted.

Replies are listed 'Best First'.
Re: Skip problematic lines while populating database
by Corion (Patriarch) on Aug 17, 2020 at 12:55 UTC

    The ideal first step would be to use DBI placeholders instead of creating a string-SQL statement. The second step would be to use Text::CSV_XS for parsing your delimited input file.

    my $sth_insert = $dbi->prepare(<<'SQL'); INSERT INTO biochemical VALUES (?,?,?,?,?,?,?,?,?,?,...,?) SQL for my $row (@rows) { $sth_insert->execute( @$row ); };

    By using DBI placeholders, you don't need to care about special characters like double quotes in strings.

    Then, when inserting, switch off the RaiseError flag:

    local $sth_insert->{RaiseError} = 0; for my $row (@rows) { $sth_insert->execute( @$row ); };

    That way, the script does not die anymore on insert errors.

      The last section of Text::CSV_XS' docs for dumping databases notes how to load special exports, e.g. where \N denotes NULL.

      You do not need to read the entire input datafile into memory, you can do it streaming:

      use Text::CSV_XS qw( csv ); # Make sure the field order matches the TSV order my $sth = $dbh->prepare (...); # See Corions post csv ( in => "file.tsv", out => undef, sep => "\t", on_in => sub { $sth->execute (@{$_[1]}) }, );

      Enjoy, Have FUN! H.Merijn
      $sth_insert->execute( @$row );

      The only thing to look out for here is that there is a gap in the sequence of elements to be inserted. One might write instead (untested):
          $sth_insert->execute(@{ $row }[ 0 .. 4, 6 .. 25 ]);

      Give a man a fish:  <%-{-{-{-<

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Skip problematic lines while populating database
by LanX (Sage) on Aug 17, 2020 at 12:49 UTC

    You can catch fatal exceptions within block evals

    There is also the cleaner module Try::Tiny

    Another option is to overwrite the __DIE__ handler in %SIG


    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    PS: Code ... TL;DR

      And this has the useful advantage of letting you know which, and how many, INSERT statements actually had problems. An exception is thrown, and you "eat" it.
Re: Skip problematic lines while populating database
by erix (Prior) on Aug 17, 2020 at 14:18 UTC

    I'd use the first part of Corion's answer and combine it with LanX's answer (see above).

    In other words, use Text::CSV_XS and use placeholders, but keep the RaiseError flag on and use it to jump out of an eval block. There, write away the failing incoming input-line into a separate .reject file, so that you can later review the rejected lines. Also count the total number of input-lines, good lines and bad lines so that at the end you can output a status line, so you'll know what has happened.

    (COPY will be much faster but it is fickle and will fail outright when you have ANY bad input lines, even flagged CSV will fail. You could try it one time: if it does not fail, you have the speed; if it does fail, you redo by using the insert line-per-line, as explained above)

Re: Skip problematic lines while populating database
by dave_the_m (Monsignor) on Aug 17, 2020 at 17:33 UTC
    You haven't given any indication of what sort of errors you're getting. Are these genuinely bad lines in the input data, or is it merely that by not using Text:CSV and placeholders, your code isn't correctly handling and quoting data fields and thus converting good data into bad SQL?


      Hi all and thank you for the suggestions. I am not so experienced with Perl + DBs, so I need to look a bit more in what you suggest me to do.
      The errors I am referring to are SQL insert error, due to bad input, e.g. a string is being inserted into a field that only accepts dates. So then what I get is an error and the code stops to execute, whereas I would like to just throw away this line to a log file and continue with as many lines are good to be inserted.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11120825]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2022-05-21 08:50 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (76 votes). Check out past polls.