Hi!

So, I admit I am a little new to perl, but I think I can communicate what I need to be done just fine, so be a little patient with me. And I am sorry it's a little long.

What I am trying to do is successfully parse an XML file. I figured that I can do it with XML::Simple module and together a friend and I have successfully put together something that does just that, but it's a little messy. So here's where we'd like to go next.

1) So, the XML file is 2 gigs. And to grab information from it, it needs to through line by line. I know XML::Simple puts everything into a hash, but it's behaving very poorly. (I'll show why below). What I want to do, is to be able to jump to a specific line in a file. So, for example, I get input A, I need somehow to know that more information about Input A is located at some line in the file, which I will call B. So, what I want to know is the byte location of line B. I know how to find the line that I want using XML::Parser and handlers, but I don't know how to get this byte location and later how to jump to it.

B) If that's not possible, then here's what I mean by the code being messy. This in a excerpt.

# read XML file $data = $xml->XMLin($contents, keyattr => {property => 'type'}); # finding protein names @names = (); $names_ref = $data->{entry}->{protein}->{name}; if (ref($names_ref) eq 'ARRAY') ## more than one name { @nameArray = @$names_ref; ## so derefrence to array and + step through foreach $nameA_ref (@nameArray) { if (ref($nameA_ref) eq 'HASH') ## it shouldn't be a has +h, but sometimes it is { %nameTable = %$nameA_ref; push (@names, $nameTable{"content"}); } else ## it is a friendly scalar { push (@names, $nameA_ref); } } } else ## only one name, so $names_ref is probably a scalar { if (ref($names_ref) eq 'HASH') ## it shouldn't be a hash, b +ut sometimes it is { %namesTable = %$names_ref; push (@names, $namesTable{"content"}); } else ## it is a friendly scalar { push (@names, $names_ref); } }

This is how the data is being processed in teh file. I am not sure why a 'HASH' or sclar suddenly comes up. I've been trying to figure out it ForceArray does anything, and kinda how to use it. So far it's only given errors, even though I think I've been using it right.

Anyway the above method does seem to work, but it's just not very nice. I can't change the XML in anyway, so maybe it's not suppose to be very nice to grab info out and maybe our method is right. I appreciate any help. If curious, a sample of the XML format is here, http://beta.uniprot.org/uniprot/P15455.xml

Thanks!


In reply to Jumping to a location in a file by kungfoo,monkee

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.