Jumping to a location in a file

kungfoo,monkee has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

So, I admit I am a little new to perl, but I think I can communicate what I need to be done just fine, so be a little patient with me. And I am sorry it's a little long.

What I am trying to do is successfully parse an XML file. I figured that I can do it with XML::Simple module and together a friend and I have successfully put together something that does just that, but it's a little messy. So here's where we'd like to go next.

1) So, the XML file is 2 gigs. And to grab information from it, it needs to through line by line. I know XML::Simple puts everything into a hash, but it's behaving very poorly. (I'll show why below). What I want to do, is to be able to jump to a specific line in a file. So, for example, I get input A, I need somehow to know that more information about Input A is located at some line in the file, which I will call B. So, what I want to know is the byte location of line B. I know how to find the line that I want using XML::Parser and handlers, but I don't know how to get this byte location and later how to jump to it.

B) If that's not possible, then here's what I mean by the code being messy. This in a excerpt.

 
# read XML file
    $data = $xml->XMLin($contents, keyattr => {property => 'type'});


    
    # finding protein names
    @names = ();
    
    $names_ref = $data->{entry}->{protein}->{name};
    
    if (ref($names_ref) eq 'ARRAY')        ## more than one name
    {
        @nameArray = @$names_ref;        ## so derefrence to array and
+ step through
        
        foreach $nameA_ref (@nameArray)
        {
            if (ref($nameA_ref) eq 'HASH')    ## it shouldn't be a has
+h, but sometimes it is
            {
                %nameTable = %$nameA_ref;
                push (@names, $nameTable{"content"});
            }
            else    ## it is a friendly scalar
            {
                push (@names, $nameA_ref);
            }
        }
    }
    else        ## only one name, so $names_ref is probably a scalar
    {
        if (ref($names_ref) eq 'HASH')    ## it shouldn't be a hash, b
+ut sometimes it is
        {
            %namesTable = %$names_ref;
            push (@names, $namesTable{"content"});
        }
        else    ## it is a friendly scalar
        {
            push (@names, $names_ref);
        }
}
[download]

This is how the data is being processed in teh file. I am not sure why a 'HASH' or sclar suddenly comes up. I've been trying to figure out it ForceArray does anything, and kinda how to use it. So far it's only given errors, even though I think I've been using it right.

Anyway the above method does seem to work, but it's just not very nice. I can't change the XML in anyway, so maybe it's not suppose to be very nice to grab info out and maybe our method is right. I appreciate any help. If curious, a sample of the XML format is here, http://beta.uniprot.org/uniprot/P15455.xml

Thanks!

Comment on Jumping to a location in a file Download Code

Replies are listed 'Best First'.
Re: Jumping to a location in a file by Limbic~Region (Chancellor) on May 12, 2008 at 23:16 UTC
kungfoo,monkee, I honestly have not read beyond your 3rd paragraph and so the following may not be applicable. XML::Twig is designed for processing huge XML documents. Also, the built-in perl function seek allows you to "jump" to any point in a file. Cheers - L~R Update: Welcome to perl and PerlMonks. I hope you stick around, it's a great language and a great place.	[reply]
Re: Jumping to a location in a file by dragonchild (Archbishop) on May 13, 2008 at 00:48 UTC
Given that XML has the ability to source in other files, why on earth do you have a 2G XML file?! Personally, I'd pre-process the file and chunk it out so that you have the ability to work with it. Either than or pre-process it into something a little more amenable to easy use, like a nice binary file. A third option would be to have XML::Simple to work with a DBM::Deep-backed hash. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: Jumping to a location in a file by Anonymous Monk on May 13, 2008 at 01:41 UTC
Thanks to you both! I got it to work nicely. yes it is mysterious working with a 2gig XML file. But that is what I have to work with. We did think about preprocessing the data, but we decided to give way a try and see how it works.	[reply]
Re: Jumping to a location in a file by scorpio17 (Canon) on May 13, 2008 at 13:31 UTC
XML gets misused/abused a great deal. Be on-guard against using it as a database. If you're going to be doing lots of searching, and selectively extracting bits of data, I'd suggest writing a script to parse the original XML and load the data into a mysql database, then use perl DBI for all your data mining operations.	[reply]