You may want to try grabbing the full HTML data for the page, and using a parser module on that (HTML::Parser or cpan::/HTML::TokeParser), in case the markup in the web page provides some structural information that you can use (like record boundaries and field labels).

On the other hand, if the blank lines that you are throwing away happen to represent boundaries between records, you should be using them as record separators, rather than throwing them away. Look up the section in the perlvar documentation about $INPUT_RECORD_SEPARATOR ($/) -- if blank lines are used only at record boundaries, then setting  $/=""; (empty string) causes perl to read a complete, multi-line record on each iteration of while(<>){...}.

Apart from that, you should be using placeholders in your insert statement -- prepare it once (before the loop) and execute it repeatedly (in the loop); this makes the "quote()"-ing of values unnecessary.

In case it's true that blank lines in the data represent record boundaries, here's an example of how it could work:

#!/usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect( " ...whatever... " ); my @insert_fields = qw{ name address1 address2 phone overall inspections staffing quality programs beds ownership }; my $insert_sql = 'insert into nursing homes ('. join( ', ', @insert_fields ). ') values ('. join( ', ', ('?') x @insert_fields ). ')'; my $insert_sth = $dbh->prepare( $insert_sql ); $/ = ""; # set input_record_separator to empty string (paragraph mode +) # just put the input file name on the command line when running the sc +ript # (or pipe the data to the script's STDIN) while (<>) # each iteration reads up to a blank line { my @lines = grep !/ Councils?$|^Mapping|^Continuing/, split( /[\r\ +n]+/ ); if ( @lines != @insert_fields ) { # skip records that won't work print "Record # $. has wrong number of fields:\n$_\n"; next; # if you redirect STDOUT to a file, you can deal with +these later } $insert_sth->execute( @lines ); } $insert_sth->finish; $dbh->disconnect;
(not tested, but it compiles, and the sql statement comes out right)

If the copy/pasted text contains "extra" blank lines within records, the simple paragraph-mode approach above won't work. Try to find some other reliable indicator of record boundaries and use that instead, then remove the blank lines by just altering that grep statement a bit:

@lines = grep !/^\s*$| Councils?$|^Mapping|^Continuing/, split( /[\r\ +n]+/ );

In reply to Re: Process Text File and Write to Database by graff
in thread Process Text File and Write to Database by spickles

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.