bcnagle has asked for the wisdom of the Perl Monks concerning the following question:

Great Monks of Perl. I beseech your guidance. I have tried to use XML::Twig however upon executing the script file, my server is not equipped and can't debug the XS version of Scalar::Util so went to XML::Parser SAX for my execution. It seems to be doing ok however I am getting an error on execution that probably is trivial but I am a total nub with Linux and I do apologize for that. The Code is as Follows along with the error code. Thank you for your assistance.

# !/usr/local/bin/perl -w BEGIN { my $base_module_dir = (-d '/home/bcnagle/perl' ? '/home/bcnagle/pe +rl' : ( getpwuid($>) )[7] . '/perl/'); unshift @INC, map { $base_module_dir . $_ } @INC; } use strict; use DBI; use XML::Parser::PerlSAX; my $path; my $Product_ID; my $Updated; my $Quality; my $Supplier_id; my $Prod_ID; my $Catid; my $On_Market; my $Model_Name; my $Product_View; my $HighPic; my $HighPicSize; my $HighPicWidth; my $HighPicHeight; my $Date_Added; my $dbh= connect_to_db(); my $insert= $dbh->prepare( "INSERT INTO files (path, Product_ID, Upda +ted, Quality, Supplier_id, Prod_ID, Catid, On_Market, Model_Name, Pro +duct_View, HighPic, HighPicSize, HighPicWidth, HighPicHeight, Date_Ad +ded) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);"); $insert->bind_param( 1, $path); $insert->bind_param( 2, $Product_ID); $insert->bind_param( 3, $Updated); $insert->bind_param( 4, $Quality); $insert->bind_param( 5, $Supplier_id); $insert->bind_param( 6, $Prod_ID); $insert->bind_param( 7, $Catid); $insert->bind_param( 8, $On_Market); $insert->bind_param( 9, $Model_Name); $insert->bind_param( 10, $Product_View); $insert->bind_param( 11, $HighPic); $insert->bind_param( 12, $HighPicSize); $insert->bind_param( 13, $HighPicWidth); $insert->bind_param( 14, $HighPicHeight); $insert->bind_param( 15, $Date_Added); my $twig = new XML::Parser(); $twig->parse( "/home/bcnagle/public_html/files.index.xml" ); $dbh->disconnect(); exit; sub connect_to_db { my $driver = "mysql"; my $dsn = "DBI:$driver:database=database;"; my $dbh = DBI->connect($dsn, 'uname', 'pword', {AutoCommit=>1}); my $drh = DBI->install_driver($driver); return( $dbh); } sub start_element { my($twig, $file) = @_; if ($file->{Name} eq 'file') { $path = $file->{Attributes}->{'path'}; $Product_ID = $file->{Attributes}->{'Product_ID'}; $Updated = $file->{Attributes}->{'Updated'}; $Quality = $file->{Attributes}->{'Quality'}; $Supplier_id = $file->{Attributes}->{'Supplier_id'}; $Prod_ID = $file->{Attributes}->{'Prod_ID'}; $Catid = $file->{Attributes}->{'Catid'}; $On_Market = $file->{Attributes}->{'On_Market'}; $Model_Name = $file->{Attributes}->{'Model_Name'}; $Product_View = $file->{Attributes}->{'Product_View'}; $HighPic = $file->{Attributes}->{'HighPic'}; $HighPicSize = $file->{Attributes}->{'HighPicSize'}; $HighPicWidth = $file->{Attributes}->{'HighPicWidth'}; $HighPicHeight = $file->{Attributes}->{'HighPicHeight'}; $Date_Added = $file->{Attributes}->{'Date_Added'}; $insert->execute(); $twig->purge; exit; #debug and testing purpose, write 1 line then exit. } else { #do nothing } } sub end_element { } sub start_document { } sub end_document { exit; }

I am connecting to the Database just fine but when it starts the handlers it gives this error:

not well-formed (invalid token) at line 1, column 0, byte 0 at /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/XML/Parser.pm line 187

I think it has to do with the first line of the xml but am not positive. the xml sheet which I cannot edit due to it coming straight from the supplier is as follows:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE ICECAT-interface SYSTEM "http://data.icecat.biz/dtd/files.in +dex.dtd"> <!-- source: ICEcat.biz 2011 --> <ICECAT-interface xsi:noNamespaceSchemaLocation="http://data.icecat.bi +z/xsd/files.index.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-in +stance"> <files.index Generated="20110712102037"> <file Date_Added="2 +0050627000000" HighPicHeight="200" HighPicWidth="320" HighPicSize="13 +817" HighPic="http://images.icecat.biz/img/norm/high/19311-1470.jpg" +Product_View="15591" Model_Name="016166400" On_Market="1" Catid="978" + Prod_ID="016166400" Supplier_id="30" Quality="ICECAT" Updated="20110 +711025126" Product_ID="19311" path="export/freexml.int/EN/19311.xml"> + <EAN_UPCS> <EAN_UPC Value="0042215447881"/> <EAN_UPC Value="04221544 +78816"/> </EAN_UPCS> <Country_Markets> <Country_Market Value="FR"/> < +Country_Market Value="UK"/> <Country_Market Value="DE"/> <Country_Mar +ket Value="DK"/> <Country_Market Value="PL"/> <Country_Market Value=" +CH"/> <Country_Market Value="CZ"/> <Country_Market Value="AT"/> </Cou +ntry_Markets> </file>...repeat with new items and so on to a total of + approximately a 1gb file </files.index> </ICECAT-interface>

Any assistance would be most grateful. Thank you for your time, Brian

Replies are listed 'Best First'.
Re: XML::Parser SAX error
by graff (Chancellor) on Jul 16, 2011 at 05:10 UTC
    This error message: not well-formed (invalid token) at line 1, column 0, byte 0

    is telling you that your xml data has a problem, and can't be parsed as xml. Since you haven't posted exactly what you used as input, I have to ask: does the xml data file that you actually use have a hyphen in front of the first open tag? I ask because the OP data sample has:

    <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE ICECAT-interface SYSTEM "http://data.icecat.biz/dtd/files.in +dex.dtd"> <!-- source: ICEcat.biz 2011 --> -<ICECAT-interface ... ^ |___ the hyphen there is a problem
    When I copied your data sample (and added a couple missing end-tags) and put it into an old xml parsing tool, I always got an error. Removing the hyphen got rid of the error.

      i just created a simple is-well-formed script and it passed as well-formed. I also tried with this code snippet as an xml file

      <?xml version="1.0" encoding="utf-8"?> <file Catid="0011" Date_added="20050910000000" HighPic="photo/location +.jpg" HighPicHeight="123" HighPicSize="1122" HighPicWidth="123" Model +="varchar model name" On_Market="0" Prod_ID="varchar id name" Product +_I="111" Product_View="11223" Quality="qualitylevel" Supplier="x" Upd +ated="20110713144032" path="required/path/to/other.doc"></file>

      it too is giving me a not well-formed error.

        So, I think the problem is you need to check the man page for XML::Parser. You're passing a file name to the "parse()" method, but you should be passing the file name to the "parsefile()" method. ("parse()" is expecting its argument to be an xml string or a file handle that you've already opened, not a file name.)

        (update: apart from the lesson you've learned now about posting valid sample data, let me also suggest that you trim your code down to the minimal snippet needed to demonstrate the problem. I had to filter out all that database crud to see the error for myself, and only then did I realize that you were using the wrong method call.)

        oh just realized those dashes are created from the tree (expand) and not actually in the file. sorry for the confusion.
Re: XML::Parser SAX error
by Anonymous Monk on Jul 16, 2011 at 05:55 UTC

    I think it has to do with the first line of the xml but am not positive

    Yes, not-well-formed means it is a problem with your xml (its not xml) and not with XML::Parser

    The xml snippet you posted seem to bear that out, but that is most likely a cut-paste error, so that is all I can add about that.

    If you don't register your handlers, XML::Parser::SAX won't call them, see the examples
    http://cpansearch.perl.org/src/KMACLEOD/libxml-perl-0.08/t/xp_sax.t
    http://cpansearch.perl.org/src/KMACLEOD/libxml-perl-0.08/examples/perlsax-test.pl
    http://cpansearch.perl.org/src/KMACLEOD/libxml-perl-0.08/examples/myhandler.pl

    I'm not sure why you switched to XML::Parser, your XML::Twig approach was practically working, so here it is

    #!/usr/bin/perl -- #~ 2011-07-15-22:39:54 by Anonymous Monk #~ perltidy -csc -otr -opr -ce -nibc -i=4 # use lib grep -d, '/home/bcnagle/perl', ( getpwuid $> )[7].'/perl/'; use strict; use warnings; use XML::Twig; use DBI; Main( @ARGV ); exit( 0 ); sub Main { #~ DoTheDo( connect_to_db() , '/home/bcnagle/file.xml'); DoTheDo( connect_to_db(), DemoData() ); } BEGIN { our @Atts = ( 'path', 'Product_ID', 'Updated', 'Quality', 'Supplier_id', 'Prod_ID', 'Catid', 'On_Market', 'Model_Name', 'Product_View', 'HighPic', 'HighPicSize', 'HighPicWidth', 'HighPicHeight', 'Date_Added' ); sub DoTheDo { my( $dbh, $xmlFile ) = @_; my $insert = $dbh->prepare( sprintf 'INSERT INTO files (%s) VALUES (%s)', join( ',', map { $dbh->quote_identifier($_) } @Atts ), join( ',', map { '?' } 1 .. @Atts ), ); my $callback = sub { my( $twig, $file ) = @_; $insert->execute( map { '' . $file->att($_) } @Atts ); $twig->purge; }; my $twig = XML::Twig->new( #~ twig_handlers => { start_tag_handlers => { # if you only want <file> without +children 'filesindex/file' => $callback, } ); #~ $twig->parsefile( $xmlFile ); $twig->xparse( $xmlFile ); $dbh->disconnect(); } ## end sub DoTheDo sub connect_to_db { my $dbh = DBI->connect( 'dbi:SQLite:dbname=short.test.sqlite', undef, undef, { RaiseError => 1, PrintError => 1, }, ); my $sql = sprintf 'CREATE TABLE files ( %s )', join( ",\n", map { $dbh->quote_identifier($_)." TEXT " } @ +Atts ); print "\n$sql\n"; eval { $dbh->do( $sql ) } or warn $@; return $dbh; } ## end sub connect_to_db } ## end BEGIN sub DemoData { ## xml_pp -s record ## http://perlmonks.com/?abspart=1;displaytype=displaycode;node_id=914 +742;part=2 my $xml = <<'__XML__'; <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE ICECAT-interface SYSTEM "http://data.icecat.biz/dtd/files.in +dex.dtd"> <!-- source: ICEcat.biz 2011 --> <ICECAT-interface xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance +" xsi:noNamespaceSchemaLocation="http://data.icecat.biz/xsd/files.ind +ex.xsd"> <filesindex Generated="20110712102037"> <file Catid="978" Date_Added="20050627000000" HighPic="http://imag +es.icecat.biz/img/norm/high/19311-1470.jpg" HighPicHeight="200" HighP +icSize="13817" HighPicWidth="320" Model_Name="016166400" On_Market="1 +" Prod_ID="016166400" Product_ID="19311" Product_View="15591" Quality +="ICECAT" Supplier_id="30" Updated="20110711025126" path="export/free +xml.int/EN/19311.xml"> <EAN_UPCS> <EAN_UPC Value="0042215447881"/> <EAN_UPC Value="0422154478816"/> </EAN_UPCS> <Country_Markets> <Country_Market Value="FR"/> <Country_Market Value="UK"/> <Country_Market Value="DE"/> <Country_Market Value="DK"/> <Country_Market Value="PL"/> <Country_Market Value="CH"/> <Country_Market Value="CZ"/> <Country_Market Value="AT"/> </Country_Markets> </file> </filesindex> </ICECAT-interface> __XML__ return $xml } ## end sub DemoData __END__
    I run the program and I get
    CREATE TABLE files ( "path" TEXT , "Product_ID" TEXT , "Updated" TEXT , "Quality" TEXT , "Supplier_id" TEXT , "Prod_ID" TEXT , "Catid" TEXT , "On_Market" TEXT , "Model_Name" TEXT , "Product_View" TEXT , "HighPic" TEXT , "HighPicSize" TEXT , "HighPicWidth" TEXT , "HighPicHeight" TEXT , "Date_Added" TEXT )
    great, no death, so I confirm the database got populated
    $ dbish dbi:SQLite:dbname=short.test.sqlite Useless localization of scalar assignment at c:/perl/site/5.12.2/lib/D +BI/Format.pm line 377. DBI::Shell 11.95 using DBI 1.616 WARNING: The DBI::Shell interface and functionality are ======= very likely to change in subsequent versions! Connecting to 'dbi:SQLite:dbname=short.test.sqlite' as ''... @dbi:SQLite:dbname=short.test.sqlite> select * from files; path,Product_ID,Updated,Quality,Supplier_id,Prod_ID,Catid,On_Market,Mo +del_Name,Product_View,HighPic,HighPicSize,HighPicWidth,HighPicHeight, +Date_Added 'export/freexml.int/EN/19311.xml','19311','20110711025126','ICECAT','3 +0','016166400','978','1','016166400','15591','http://images.icecat.bi +z/img/norm/high/19311-1470.jpg','13817','320','200','20050627000000' [1 rows of 15 fields returned] @dbi:SQLite:dbname=short.test.sqlite> ;quit Disconnecting from dbi:SQLite:dbname=short.test.sqlite.
    program finished
      thank you very much, i will save this code as well and hopefully get the XS Version of Scalar::Util fixed soon as well. I was just trying to get a bypass to it since my host provider really sucks ass (webhostingpad.com) and has crap for support. I will get back to trying to figure out why that isn't installing. Thanks again, Brian