in reply to Re^2: How do I create an array of hashes from an input text file?
in thread How do I create an array of hashes from an input text file?

For very large files XML::Simple is probably not a good route. It will require you to load the entire XML data structure into memory.

Should you see performance issues, you should take a look at XML::LibXML which is much more powerful. It offers and interface to DOM and SAX parsers. In particular, SAX based parsing may be the best choice if memory becomes an issue as it is event based as opposed to data structure based.

SAX will offer more in the way of memory management while DOM will offer more speed at the price of a larger footprint.

If you want to stick with a XML::Simple style interface, but just gain some speed, you can take a look at XML::Bare which is written in XS and among the fastest in terms of runtime. It does have a few less niceties that XML::Simple, but offers an option to create the same style of data structures.

Replies are listed 'Best First'.
Re^4: How do I create an array of hashes from an input text file?
by MrSnrub (Beadle) on Nov 12, 2011 at 00:44 UTC

    Yeah, I guess if XML::Simple must load the entire data structure into memory, I don't think it's quite what I'm looking for. It just takes too long.

    Maybe an example might help. Suppose I have the following XML input file:

    <?xml version="1.0"?> <library> <book> <title>Dreamcatcher</title> <author>Stephen King</author> <genre>Horror</genre> <pages>899</pages> <price>23.99</price> <rating>5</rating> <publication_date>11/27/2001</publication_date> </book> <book> <title>Mystic River</title> <author>Dennis Lehane</author> <genre>Thriller</genre> <pages>390</pages> <price>17.49</price> <rating>4</rating> <publication_date>07/22/2003</publication_date> </book> <book> <title>The Lord Of The Rings</title> <author>J. R. R. Tolkien</author> <genre>Fantasy</genre> <pages>3489</pages> <price>10.99</price> <rating>5</rating> <publication_date>10/12/2005</publication_date> </book> </library>

    Suppose I only want to import books that were published after January 1, 2002. If I apply such a filter when I do my initial import, the result should look like this:

    $VAR1 = { 'book' => [ { 'publication_date' => '07/22/2003', 'price' => '17.49', 'author' => 'Dennis Lehane', 'title' => 'Mystic River', 'rating' => '4', 'pages' => '390', 'genre' => 'Thriller' }, { 'publication_date' => '10/12/2005', 'price' => '10.99', 'author' => 'J. R. R. Tolkien', 'title' => 'The Lord Of The Rings', 'rating' => '5', 'pages' => '3489', 'genre' => 'Fantasy' } ] };

    The import will completely ignore entries that don't meet the specified criteria (in this case, publication_date must be >= '1/1/2002'). Can DOM or SAX-based parsing do this?

      In terms of gaining speed, this would come from avoiding having to parse the entire xml document. I'm not sure you can avoid this.

      Given that parsing has to take place in either case, you can make trade offs of speed versus memory after that point, but you will still have that initial time investment in parsing.

      This is where something like XML::Bare would help you out, as it will parse an order of magnitude at least faster than XML::Simple.

      I have gone down the road of trying to find clever solutions to process very large xml files quickly, but ultimately I have generally settled on a file management solution instead.

      Simply breaking your xml documents into smaller logical pieces will give you more speed gains than a strictly xml parsing approach. I however did this in a case where I needed to process all records in the xml and wanted to gain simple parsing speed.

      This may mean something as simple as keeping books with titles beginning with a certain letter in individual files. This is not really applicable to your case if you want to be able to filter by any field.

      My real advice would be to look at a database solution instead. This is really the only way to return matching records with reliable speed. If you are more comfortable with text files, at least to start, you can look at DBD::CSV as a way to get your foot in the DBI door.

      If you go with XML::Bare, be aware that in my experience the ForceArray parameter does not produce expected results. I created my own work around for this to process the data structure afterwards into what XML::Simple with ForceArray would produce. I can pass it along to you if you go this route.