in reply to How do I create an array of hashes from an input text file?

XML::Simple or similar is undoubtedly the laziest path but if you do actually want to get your hands dirty you could do something like:
use feature qw(say); use Data::Dumper; my $data =<<'XML-type'; <item><key1>someValue</key1><key2>someValue</key2><key3>someValue</key +3> <key4>someValue</key4></item><item><key1>someValue</key1><key2>someVal +ue </key2><key3>someValue</key3><key4>someValue</key4></item><item> <key1>someValue</key1><key2>someValue</key2><key3>someValue</key3><key +4>someValue</key4></item> XML-type my @processed; local $/ = '</item>'; open STRH , '<', \($data); my @items = <STRH>; close STRH; @processed = map { my $item =$_; $item =~ s/<.?item>//g; push @processed, {$item =~ m { <([^>]+)>([^<]+)< }gx}; } @items; print Dumper \@processed;
Prints:
$VAR1 = [ { 'key2' => 'someValue', 'key4' => 'someValue', 'key1' => 'someValue', 'key3' => 'someValue' }, { 'key2' => 'someValue ', 'key4' => 'someValue', 'key1' => 'someValue', 'key3' => 'someValue' }, { 'key2' => 'someValue', 'key4' => 'someValue', 'key1' => 'someValue', 'key3' => 'someValue' }, {} ];
For bonus points work out why there is an empty hash at the end.(I haven't got time just now!)
Have fun

Replies are listed 'Best First'.
Re^2: How do I create an array of hashes from an input text file?
by MrSnrub (Beadle) on Nov 11, 2011 at 19:14 UTC
    I tried calling xmlin using XML::Simple and that almost does everything I want. Thanks for your help. One question: Suppose my XML file is really really big, and I only want to add data that meets a certain criteria (say, where the value of key2 is "Joe"). How do I filter that input?

      For very large files XML::Simple is probably not a good route. It will require you to load the entire XML data structure into memory.

      Should you see performance issues, you should take a look at XML::LibXML which is much more powerful. It offers and interface to DOM and SAX parsers. In particular, SAX based parsing may be the best choice if memory becomes an issue as it is event based as opposed to data structure based.

      SAX will offer more in the way of memory management while DOM will offer more speed at the price of a larger footprint.

      If you want to stick with a XML::Simple style interface, but just gain some speed, you can take a look at XML::Bare which is written in XS and among the fastest in terms of runtime. It does have a few less niceties that XML::Simple, but offers an option to create the same style of data structures.

        Yeah, I guess if XML::Simple must load the entire data structure into memory, I don't think it's quite what I'm looking for. It just takes too long.

        Maybe an example might help. Suppose I have the following XML input file:

        <?xml version="1.0"?> <library> <book> <title>Dreamcatcher</title> <author>Stephen King</author> <genre>Horror</genre> <pages>899</pages> <price>23.99</price> <rating>5</rating> <publication_date>11/27/2001</publication_date> </book> <book> <title>Mystic River</title> <author>Dennis Lehane</author> <genre>Thriller</genre> <pages>390</pages> <price>17.49</price> <rating>4</rating> <publication_date>07/22/2003</publication_date> </book> <book> <title>The Lord Of The Rings</title> <author>J. R. R. Tolkien</author> <genre>Fantasy</genre> <pages>3489</pages> <price>10.99</price> <rating>5</rating> <publication_date>10/12/2005</publication_date> </book> </library>

        Suppose I only want to import books that were published after January 1, 2002. If I apply such a filter when I do my initial import, the result should look like this:

        $VAR1 = { 'book' => [ { 'publication_date' => '07/22/2003', 'price' => '17.49', 'author' => 'Dennis Lehane', 'title' => 'Mystic River', 'rating' => '4', 'pages' => '390', 'genre' => 'Thriller' }, { 'publication_date' => '10/12/2005', 'price' => '10.99', 'author' => 'J. R. R. Tolkien', 'title' => 'The Lord Of The Rings', 'rating' => '5', 'pages' => '3489', 'genre' => 'Fantasy' } ] };

        The import will completely ignore entries that don't meet the specified criteria (in this case, publication_date must be >= '1/1/2002'). Can DOM or SAX-based parsing do this?