in reply to Re: How to Parse Huge XML Files ?
in thread How to Parse Huge XML Files ?

Thanks for all these answers. here is my code
=head1 MySAXHandler MySAXHandler is a little package to describe what to do with the event +s generated by the SAX parser. All events are treated by methods described here. MySAXHandler is inst +anciated in the parser declaration. Variable %entryArgs keeps data until end_element = 'entry' event and t +hen it writes data. It's keys are : =over 3 =item 'Accession' which is a hash structure also as there can be mor +e than one accession number for each entry. Accession number is hold +in the key, and value is empty. =item 'Organism' which stores the organism name =item =item =item =back Other variables used in this package =over 3 =item $currentElement : Stores the Name of the current element =item $currentAttributes : Store a hash reference towards the curren +t element's attributes =item $entryCounter : counts <entry> tags seen =item $entryDone : counts </entry> tags seen =item $dbReferenceOn : Set to 1 if we are in a <dbReference> ... </d +bReference> block =item $comment : as above for <comment> ... </comment> =back =head1 Methods =cut package MySAXHandler; use base qw(XML::SAX::Base); $|=1; my (%entryArgs); my ( $currentElement, $currentAttributes, $entryCounter, $entryDone, $accessionNumber, $currentAttrs ); my ( $dbReferenceOn, $commentOn, $organismOn ) = ( 0, 0, 0 ); =head2 start_document For us this method does nothing for now. =cut sub start_document { my ( $self, $doc ) = @_; #What to do with the document start print "Here we go ..............\n\n"; } =head2 start_element Event 'start_element' calls this function. Depending on the tag Name, it does various things. Tag name is stored into $currentElement. It's attributes into $current +Attributes. =cut sub start_element { my ( $self, $el ) = @_; #clear %entryArgs when start_element = <entry> if ( $el->{Name} eq 'entry' ) { undef(%entryArgs); $entryCounter++; } #Set the current element name and attributes $currentElement = $el->{Name}; #print $el->{Name}."\n"; $currentAttributes = $el->{Attributes}; # foreach my $key (keys %{$currentAttributes}) { # print "===============$key => ".%{$currentAttributes}->{$ +key}->{Value}."================\n"; # } #Set some flag to 1 for interesting blocks $dbReferenceOn = 1 if ( $el->{Name} eq 'dbReferen +ce' ); $commentOn = 1 if ( $el->{Name} eq 'comment' +); $organismOn = 1 if ( $el->{Name} eq 'organism' + ); $organismAttrs = $currentAttributes if ( $el->{Name} eq 'name' ); } =head2 end_element Store data into the dB, and prints out if tag name = entry. Set flags back to 0. =cut sub end_element { my ( $self, $el ) = @_; #What to do now ... print out or store in dB if ( $el->{Name} eq 'entry' ) { print "Accession number(s) :\n"; foreach my $key ( sort keys %{ $entryArgs{'Accession'} } ) { print "\t=>$key\n"; } print "\n\tOrganism Name = " . $entryArgs{'Organism'} . "\n"; } #foreach my $key (sort keys %entryArgs) { #Store Accession Number $entryArgs{'Accession'}{$accessionNumber}++ if ( $el->{Name} eq 'accession' ); undef($accessionNumber) if ( $el->{Name} eq 'accession' ); #Set the flags back $dbReferenceOn = 0 if ( $el->{Name} eq 'dbReference' ); $commentOn = 0 if ( $el->{Name} eq 'comment' ); $organismOn = 0 if ( $el->{Name} eq 'organism' ); } =head2 characters Store data into the entryArgs hash. =cut sub characters { my ( $self, $char ) = @_; my $text = $char->{Data}; #Store accesion number if ( $currentElement eq 'accession' ) { $accessionNumber .= $text; $accessionNumber =~ s/[\s\r\n]//g; } #Store organism name if ( $organismOn == 1 and $currentElement eq 'name' ) { if ( $organismAttrs->{'{}type'}->{Value} eq 'scientific' ) { $entryArgs{'Organism'} .= $text; } } } sub end_document { my ( $self, $doc ) = @_; print "\n\n ...................... et voilą ;-)\n"; }
And you were right ! I forgot to undef the hash structure that holds data !! But it still doesn't work When i launch it it fulls my mem & swap (the sum is 6Go).
For example, it even doesn't print the first method message "Here we go ..............", which is printed in response to start_document event.
The main code is here :
#!/usr/bin/perl -w # POD documentation - main docs before the code =head1 NAME parseUNIPROT4MySQL =head1 DESCRIPTION This script parse the XML file containing uniprot knowledgebase, fetch + informations ad insert them into a MySQL db. =head1 CONTACT TEXTORIS Julien , <julien.textoris@gmail.com> =head1 USAGE ./parseUNIPROT4MySQL.pl <XML_filename> <function> function : - 'coordinate' - =cut use strict; use warnings; use DBI; use XML::SAX; use lib './'; use MySAXHandler; $|=1; =head1 Variables Declaration =over 3 =item $filename : XML file to parse =item $function : implemented function to use =item $dbd : MySQL connection =item $dbname : DB Name =item $host : computer which host the database =item $user / $pass : user and password to connect DB =item $insert_ensembl = Insert EnsEMBL_ID into ensEMBL_ID table if it + doesn't exists =item $insert_uniprot = Insert Uniprot_ID into Uniprot_ID table if it + doesn't exists =item $insert_relEnsemblUniprot = insert values into the relational t +able (n:m relation) =item $parser = new XML parser object =item $doc = XML object loaded from $filename, option strip whitespac +e make parsing 30% faster =item $rootNode = XML root tag =item $AccNumber = store Uniprot Accession number of entry being proc +essed =item $type = type argument of <comment> tag =item $value = value of <text> tag of a given comment tag =back =cut my $filename = $ARGV[0]; my $function = $ARGV[1]; my $dbname = "AnnotationDB"; my $host = "localhost"; my $user = "jtextori"; my $pass = "marcel"; #my $dbd = DBI->connect( "DBI:mysql:$dbname;host=$host", # $user, $pass, { RaiseError => 1 } ) # or die # "can't connect : \nerror1 ==> $DBI::errstr\nerror1 ==> $@\nerror1 = +=> $!\n\n"; # #my $insert_ensembl = $dbd->prepare('INSERT INTO ensEMBL_ID (`ensEMBL_ +ID`) VALUES (?)'); #my $insert_uniprot = $dbd->prepare('INSERT INTO Uniprot_ID (`Uniprot_ +ID`) VALUES (?)'); #my $insert_relEnsemblUniprot = $dbd->prepare('INSERT INTO ensEMBL_ID_ +has_Uniprot_ID (`ensEMBL_ID_ensEMBL_ID`,`Uniprot_ID_Uniprot_ID`) VALU +ES(?,?)'); my $parser = XML::SAX::ParserFactory->parser(Handler => MySAXHandler-> +new); $parser->parse_uri($filename); exit(0)
Thanks for advices, i'll have a look at XML::Twig also. your sincerily Julien

Edited by planetscape - added readmore tags

( keep:0 edit:10 reap:0 )

Replies are listed 'Best First'.
Re^3: How to Parse Huge XML Files ?
by jsegal (Friar) on Jun 01, 2006 at 16:05 UTC
    Hmm. If your initial status message isn't getting printed out, I'd double check that you are running what you think you are running. (I find the debugger invaluable in cases like this -- I happen to like running it from within (x)emacs). Sometimes a module doesn't do what you think it is going to do, and sometime you aren't even running the code you think you are running!

    I know I've been burned by editing a file in one directory, but actually running a version in another directory -- when putting in debugging print statements, I've learned to vary what I output, so I instantly have a positive control that I am running the version of the file I should be -- if the output is "foo" but I just added "baz", I instantly know something is amiss, and don't try to debug the wrong thing...

    All that being said, this may not be your problem, but it might give you some clues as to what is going on....

    Good luck,


    --JAS