kungfoo,monkee has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

So, I admit I am a little new to perl, but I think I can communicate what I need to be done just fine, so be a little patient with me. And I am sorry it's a little long.

What I am trying to do is successfully parse an XML file. I figured that I can do it with XML::Simple module and together a friend and I have successfully put together something that does just that, but it's a little messy. So here's where we'd like to go next.

1) So, the XML file is 2 gigs. And to grab information from it, it needs to through line by line. I know XML::Simple puts everything into a hash, but it's behaving very poorly. (I'll show why below). What I want to do, is to be able to jump to a specific line in a file. So, for example, I get input A, I need somehow to know that more information about Input A is located at some line in the file, which I will call B. So, what I want to know is the byte location of line B. I know how to find the line that I want using XML::Parser and handlers, but I don't know how to get this byte location and later how to jump to it.

B) If that's not possible, then here's what I mean by the code being messy. This in a excerpt.

# read XML file $data = $xml->XMLin($contents, keyattr => {property => 'type'}); # finding protein names @names = (); $names_ref = $data->{entry}->{protein}->{name}; if (ref($names_ref) eq 'ARRAY') ## more than one name { @nameArray = @$names_ref; ## so derefrence to array and + step through foreach $nameA_ref (@nameArray) { if (ref($nameA_ref) eq 'HASH') ## it shouldn't be a has +h, but sometimes it is { %nameTable = %$nameA_ref; push (@names, $nameTable{"content"}); } else ## it is a friendly scalar { push (@names, $nameA_ref); } } } else ## only one name, so $names_ref is probably a scalar { if (ref($names_ref) eq 'HASH') ## it shouldn't be a hash, b +ut sometimes it is { %namesTable = %$names_ref; push (@names, $namesTable{"content"}); } else ## it is a friendly scalar { push (@names, $names_ref); } }

This is how the data is being processed in teh file. I am not sure why a 'HASH' or sclar suddenly comes up. I've been trying to figure out it ForceArray does anything, and kinda how to use it. So far it's only given errors, even though I think I've been using it right.

Anyway the above method does seem to work, but it's just not very nice. I can't change the XML in anyway, so maybe it's not suppose to be very nice to grab info out and maybe our method is right. I appreciate any help. If curious, a sample of the XML format is here, http://beta.uniprot.org/uniprot/P15455.xml

Thanks!

Replies are listed 'Best First'.
Re: Jumping to a location in a file
by Limbic~Region (Chancellor) on May 12, 2008 at 23:16 UTC
    kungfoo,monkee,
    I honestly have not read beyond your 3rd paragraph and so the following may not be applicable.

    XML::Twig is designed for processing huge XML documents. Also, the built-in perl function seek allows you to "jump" to any point in a file.

    Cheers - L~R

    Update: Welcome to perl and PerlMonks. I hope you stick around, it's a great language and a great place.

Re: Jumping to a location in a file
by dragonchild (Archbishop) on May 13, 2008 at 00:48 UTC
    Given that XML has the ability to source in other files, why on earth do you have a 2G XML file?! Personally, I'd pre-process the file and chunk it out so that you have the ability to work with it. Either than or pre-process it into something a little more amenable to easy use, like a nice binary file.

    A third option would be to have XML::Simple to work with a DBM::Deep-backed hash.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      Thanks to you both! I got it to work nicely.

      yes it is mysterious working with a 2gig XML file. But that is what I have to work with. We did think about preprocessing the data, but we decided to give way a try and see how it works.

Re: Jumping to a location in a file
by scorpio17 (Canon) on May 13, 2008 at 13:31 UTC
    XML gets misused/abused a great deal. Be on-guard against using it as a database. If you're going to be doing lots of searching, and selectively extracting bits of data, I'd suggest writing a script to parse the original XML and load the data into a mysql database, then use perl DBI for all your data mining operations.