bcnagle has asked for the wisdom of the Perl Monks concerning the following question:

Hello oh great monks. I am in serious need of assistance. I have been researching for 2 weeks now and have narrowed my search to XML::Twig but still not understanding this for some reason. I have been coding in C# and PHP for the rest of my project however for this one instance, I must use a perl script on a linux box to parse a very large (almost 1gig) xml document to get all attributes to load into a database once per month. The xml file contains this bit of information:

<interface> <index Generated="20110714102016"> <file path="required/path/to/other.doc" Product_ID="111" Updated= +"20110713144032" Quality="qualitylevel" Supplier_id="x" Prod_ID="varc +har id name" Catid="0011" On_Market="0" Model_Name="varchar model nam +e" Product_View="11223" HighPic="photo/location.jpg" HighPicSize="112 +2" HighPicWidth="123" HighPicHeight="123" Date_Added="20050910000000" +/> <EAN_UPCS> <EAN_UPC Value="0088698005668"/> <EAN_UPC Value="0886980056684"/> </EAN_UPCS> <Country_Markets> <Country_Market Value="NL"/> <Country_Market Value="BE"/> </Country_Markets>

##repeat file with different attribute strings a million times.

Here is the thing. I only need the attributes parsed, single <file> at a time then flushed, and only in the <file> section skipping the rest.

Can somebody help me with this or at minimum have any hashed guideline for this particular request, or link to code where somebody has done something similar and explain it to me?

Thank you very much for all of your help.

Update

Thank you for the tutorial, I think I'm almost there now and understand the majority of it, one thing is I'm getting an error code and believe something isn't installed. A little more research and I'll post another update. Thanks again.

Update II

code is here and error follows. Trying to fix the error right now but so far no luck.
# !/usr/local/bin/perl -w BEGIN { my $base_module_dir = (-d '/home/bcnagle/perl' ? '/home/bcnagle/pe +rl' : ( getpwuid($>) )[7] . '/perl/'); unshift @INC, map { $base_module_dir . $_ } @INC; } use strict; use XML::Twig; use DBI; my $path; my $Product_ID; my $Updated; my $Quality; my $Supplier_id; my $Prod_ID; my $Catid; my $On_Market; my $Model_Name; my $Product_View; my $HighPic; my $HighPicSize; my $HighPicWidth; my $HighPicHeight; my $Date_Added; my $dbh= connect_to_db(); my $insert= $dbh->prepare( "INSERT INTO files (path, Product_ID, Upda +ted, Quality, Supplier_id, Prod_ID, Catid, On_Market, Model_Name, Pro +duct_View, HighPic, HighPicSize, HighPicWidth, HighPicHeight, Date_Ad +ded) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);"); my $twig = new XML::Twig( twig_handlers => {indexfile => \$file} ); $twig->parsefile( "/file.xml" ); $twig->flush; # flush the end of t +he twig $dbh->disconnect(); exit; sub connect_to_db { my $driver = "mysql"; my $dsn = "DBI:$driver:database=database;"; my $dbh = DBI->connect($dsn, 'uname', 'pword', {AutoCommit=>1}); my $drh = DBI->install_driver($driver); return( $dbh); } sub indexfile { my($twig, $file) = @_; $path = $file->att( 'path'); $Product_ID = $file->att( 'Product_ID'); $Updated = $file->att( 'Updated'); $Quality = $file->att( 'Quality'); $Supplier_id = $file->att( 'Supplier_id'); $Prod_ID = $file->att( 'Prod_ID'); $Catid = $file->att( 'Catid'); $On_Market = $file->att( 'On_Market'); $Model_Name = $file->att( 'Model_Name'); $Product_View = $file->att( 'Product_View'); $HighPic = $file->att( 'HighPic'); $HighPicSize = $file->att( 'HighPicSize'); $HighPicWidth = $file->att( 'HighPicWidth'); $HighPicHeight = $file->att( 'HighPicHeight'); $Date_Added = $file->att( 'Date_Added'); $insert->bind_param( 1, $path); $insert->bind_param( 2, $Product_ID); $insert->bind_param( 3, $Updated); $insert->bind_param( 4, $Quality); $insert->bind_param( 5, $Supplier_id); $insert->bind_param( 6, $Prod_ID); $insert->bind_param( 7, $Catid); $insert->bind_param( 8, $On_Market); $insert->bind_param( 9, $Model_Name); $insert->bind_param( 10, $Product_View); $insert->bind_param( 11, $HighPic); $insert->bind_param( 12, $HighPicSize); $insert->bind_param( 13, $HighPicWidth); $insert->bind_param( 14, $HighPicHeight); $insert->bind_param( 15, $Date_Added); $insert->execute(); $twig->purge; }

errors out with:

weaken is only available with the XS version of Scalar::Util at /home/bcnagle/perl/usr/lib/perl5/site_perl/5.8.8/XML/Twig.pm line 117 BEGIN failed--compilation aborted at /home/bcnagle/perl/usr/lib/perl5/site_perl/5.8.8/XML/Twig.pm line 172. Compilation failed in require at /home/bcnagle/public_html/xmlparse.pl line 9. BEGIN failed--compilation aborted at /home/bcnagle/public_html/xmlparse.pl line 9.

Update III

Starting new thread since the XML::Twig isn't working on my server, I decided to go to normal XML::Parser with SAX. Thank you all for your help, you have been very informative.

Replies are listed 'Best First'.
Re: XML::Twig to mysql totally lost
by Tanktalus (Canon) on Jul 15, 2011 at 03:17 UTC

    What have you tried? Without looking at the XML::Twig documentation, I would set up a handler for the close/end of the outer most element that I cared about, at which point all the elements/attributes inside would be available. I'd use the info there, insert it to the db, flush it, and return from the callback. This should keep memory usage at a minimum, speed at a maximum, and keep things simple...ish.

      I added the current code I am trying to use below however not positive I did it correctly. From what I was reading in the tutorial, I am grabbing the events where the <file> gets parsed, uploading it to the sql, then trashing the parse and moving to the next one. I am getting an error that I'm trying to get resolved atm but any feedback would be much appreciated.
Re: XML::Twig to mysql totally lost
by Khen1950fx (Canon) on Jul 15, 2011 at 04:45 UTC
    The xml was a little messy. I cleaned it up. This is what I was able to verify as well-formed:
    <?xml version="1.0" encoding="utf-8"?> <file Catid="0011" Date_added="20050910000000" HighPic="photo/location +.jpg" HighPicHeight="123" HighPicSize="1122" HighPicWidth="123" Model +="varchar model name" On_Market="0" Prod_ID="varchar id name" Product +_I="111" Product_View="11223" Quality="qualitylevel" Supplier="x" Upd +ated="20110713144032" path="required/path/to/other.doc"></file>
    The code that I used to read it:
    #!/usr/bin/perl use strict; use Data::Dumper; use XML::Simple qw(:strict); my $xs = XML::Simple->new(); my $xml = $xs->XMLin('/root/Desktop/914477.xml', ForceArray => 1, KeyAttr => 1); binmode STDOUT, ':encoding(utf8)'; print Dumper($xml);
    Is that what you were thinking of?

      Thanks for the code but this is what I have so far

      # !/usr/local/bin/perl -w BEGIN { my $base_module_dir = (-d '/home/bcnagle/perl' ? '/home/bcnagle/pe +rl' : ( getpwuid($>) )[7] . '/perl/'); unshift @INC, map { $base_module_dir . $_ } @INC; } use strict; use XML::Twig; use DBI; my $path; my $Product_ID; my $Updated; my $Quality; my $Supplier_id; my $Prod_ID; my $Catid; my $On_Market; my $Model_Name; my $Product_View; my $HighPic; my $HighPicSize; my $HighPicWidth; my $HighPicHeight; my $Date_Added; my $dbh= connect_to_db(); my $insert= $dbh->prepare( "INSERT INTO files (path, Product_ID, Upda +ted, Quality, Supplier_id, Prod_ID, Catid, On_Market, Model_Name, Pro +duct_View, HighPic, HighPicSize, HighPicWidth, HighPicHeight, Date_Ad +ded) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);"); my $twig = new XML::Twig( twig_handlers => {indexfile => \$file} ); $twig->parsefile( "/file.xml" ); $twig->flush; # flush the end of t +he twig $dbh->disconnect(); exit; sub connect_to_db { my $driver = "mysql"; my $dsn = "DBI:$driver:database=database;"; my $dbh = DBI->connect($dsn, 'uname', 'pword', {AutoCommit=>1}); my $drh = DBI->install_driver($driver); return( $dbh); } sub indexfile { my($twig, $file) = @_; $path = $file->att( 'path'); $Product_ID = $file->att( 'Product_ID'); $Updated = $file->att( 'Updated'); $Quality = $file->att( 'Quality'); $Supplier_id = $file->att( 'Supplier_id'); $Prod_ID = $file->att( 'Prod_ID'); $Catid = $file->att( 'Catid'); $On_Market = $file->att( 'On_Market'); $Model_Name = $file->att( 'Model_Name'); $Product_View = $file->att( 'Product_View'); $HighPic = $file->att( 'HighPic'); $HighPicSize = $file->att( 'HighPicSize'); $HighPicWidth = $file->att( 'HighPicWidth'); $HighPicHeight = $file->att( 'HighPicHeight'); $Date_Added = $file->att( 'Date_Added'); $insert->bind_param( 1, $path); $insert->bind_param( 2, $Product_ID); $insert->bind_param( 3, $Updated); $insert->bind_param( 4, $Quality); $insert->bind_param( 5, $Supplier_id); $insert->bind_param( 6, $Prod_ID); $insert->bind_param( 7, $Catid); $insert->bind_param( 8, $On_Market); $insert->bind_param( 9, $Model_Name); $insert->bind_param( 10, $Product_View); $insert->bind_param( 11, $HighPic); $insert->bind_param( 12, $HighPicSize); $insert->bind_param( 13, $HighPicWidth); $insert->bind_param( 14, $HighPicHeight); $insert->bind_param( 15, $Date_Added); $insert->execute(); $twig->purge; }

      I am getting a shell error when running the script stating that

      weaken is only available with the XS version of Scalar::Util at /home/bcnagle/perl/usr/lib/perl5/site_perl/5.8.8/XML/Twig.pm line 117 BEGIN failed--compilation aborted at /home/bcnagle/perl/usr/lib/perl5/site_perl/5.8.8/XML/Twig.pm line 172. Compilation failed in require at /home/bcnagle/public_html/xmlparse.pl line 9. BEGIN failed--compilation aborted at /home/bcnagle/public_html/xmlparse.pl line 9.

        You are probably using a RedHat-based distribution, which includes Scalar::Util without its XS part, and so doesn't provide weaken (see Task::Weaken). The solution would be to re-install Scalar::Util, from CPAN.

        Also, I am not sure why you do a $twig->flush; at the end of your code, since you are not outputting the XML at all.

      I also can't clean up the xml for it's a fixed xml from a manufacturer database for distribution. The file is extremely big so must parse line by line or else the server may cancel the script. looking into switching hosting company for many issues, this being one of them.
Re: XML::Twig to mysql totally lost
by Anonymous Monk on Jul 15, 2011 at 01:49 UTC
Re: XML::Twig to mysql totally lost
by Anonymous Monk on Jul 15, 2011 at 08:32 UTC
    Move all the  $insert->bind_param( out of the callback (sub indexfile), you only need to do it once
      so $insert->bind_param( 1, $path, 2, etc, 3, etc, 4 etc, 5, etc, etc, etc); ??
        Um, I said move, not rewrite :) Move as in
        my $insert = ... ... sub indexfile { ... $insert->bind_param( ... ... $insert->execute();
        becomes
        my $insert = ... $insert->bind_param( ... ... ... sub indexfile { ... $insert->execute();
        you prepare once, you bind once, you execute many many many many times, like each time you update your bound params :)