in reply to Re^3: How do I create an array of hashes from an input text file?
in thread How do I create an array of hashes from an input text file?

Yeah, I guess if XML::Simple must load the entire data structure into memory, I don't think it's quite what I'm looking for. It just takes too long.

Maybe an example might help. Suppose I have the following XML input file:

<?xml version="1.0"?> <library> <book> <title>Dreamcatcher</title> <author>Stephen King</author> <genre>Horror</genre> <pages>899</pages> <price>23.99</price> <rating>5</rating> <publication_date>11/27/2001</publication_date> </book> <book> <title>Mystic River</title> <author>Dennis Lehane</author> <genre>Thriller</genre> <pages>390</pages> <price>17.49</price> <rating>4</rating> <publication_date>07/22/2003</publication_date> </book> <book> <title>The Lord Of The Rings</title> <author>J. R. R. Tolkien</author> <genre>Fantasy</genre> <pages>3489</pages> <price>10.99</price> <rating>5</rating> <publication_date>10/12/2005</publication_date> </book> </library>

Suppose I only want to import books that were published after January 1, 2002. If I apply such a filter when I do my initial import, the result should look like this:

$VAR1 = { 'book' => [ { 'publication_date' => '07/22/2003', 'price' => '17.49', 'author' => 'Dennis Lehane', 'title' => 'Mystic River', 'rating' => '4', 'pages' => '390', 'genre' => 'Thriller' }, { 'publication_date' => '10/12/2005', 'price' => '10.99', 'author' => 'J. R. R. Tolkien', 'title' => 'The Lord Of The Rings', 'rating' => '5', 'pages' => '3489', 'genre' => 'Fantasy' } ] };

The import will completely ignore entries that don't meet the specified criteria (in this case, publication_date must be >= '1/1/2002'). Can DOM or SAX-based parsing do this?

Replies are listed 'Best First'.
Re^5: How do I create an array of hashes from an input text file?
by Kc12349 (Monk) on Nov 16, 2011 at 17:46 UTC

    In terms of gaining speed, this would come from avoiding having to parse the entire xml document. I'm not sure you can avoid this.

    Given that parsing has to take place in either case, you can make trade offs of speed versus memory after that point, but you will still have that initial time investment in parsing.

    This is where something like XML::Bare would help you out, as it will parse an order of magnitude at least faster than XML::Simple.

    I have gone down the road of trying to find clever solutions to process very large xml files quickly, but ultimately I have generally settled on a file management solution instead.

    Simply breaking your xml documents into smaller logical pieces will give you more speed gains than a strictly xml parsing approach. I however did this in a case where I needed to process all records in the xml and wanted to gain simple parsing speed.

    This may mean something as simple as keeping books with titles beginning with a certain letter in individual files. This is not really applicable to your case if you want to be able to filter by any field.

    My real advice would be to look at a database solution instead. This is really the only way to return matching records with reliable speed. If you are more comfortable with text files, at least to start, you can look at DBD::CSV as a way to get your foot in the DBI door.

    If you go with XML::Bare, be aware that in my experience the ForceArray parameter does not produce expected results. I created my own work around for this to process the data structure afterwards into what XML::Simple with ForceArray would produce. I can pass it along to you if you go this route.