andrewheiss has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on parsing a large XML file of a blog backup and reordering and reformatting the text to be used in InDesign. The XML file is formatted somewhat like this:
... <entry> <type>Post</type> <url>URL</url> <title>The title</title> <content>The blog post goes here</content> </entry> <entry> <type>Comment</type> <in-reference-to>URL</in-reference-to> <title>The title of the comment</title> <content>The content itself</content> </entry> ...
Right now I have my code go through the XML file once, find all the comments, and put them in a temporary file using File::Temp. Here it is in pseudo code
my $tmpComments = File::Temp->new(SUFFIX=>'.txt') or die "File::Temp: +$!\n"; foreach (#Go through the whole xml file) { if ($type eq 'Comment') { # Parse the xml and save the parts as variables print $tmpComments "$posturl~~~$date~~~$author~~~$content\n"; # $posturl holds the <in-reference-to> url } }
I then loop through the xml file again to find and print out all the actual blog posts. On each pass I search the entire temporary file for any comments that start with the post url and add those comments to the post, like so (again in pseudo code):
foreach (#Loop through the xml file again) { print "$title\n"; print "$content\n"; ... seek $tmpComments, 0, 0 or die "Seek $tmpComments failed: $!\n"; #Re +wind temporary comments file while (my $line = <$tmpComments>) { # Split each line, store url as $commentID my @process_comment = split(/~~~/, $line); my $commentID = $process_comment[0]; # If the urls match, add it to the comments variable if ($commentID eq $posturl) { my $commentDate = $process_comment[1]; my $commentAuthor = $process_comment[2]; my $commentBody = $process_comment[3]; $comments.= "$commentDate | $commentAuthor | $commentBody"; } } print $comments; }

It works just fine, but it takes forever with longer xml files, probably since it's uselessly looping through the temporary file so many times.

Is there a more efficient way of doing this?

Is there a way to use a cache to store all the comments indexed by the URL so that when I loop through the blog posts I can look at the cache and pull out the appropriate comments?

Replies are listed 'Best First'.
Re: Use temporary file or cache
by moritz (Cardinal) on May 28, 2009 at 08:53 UTC

    I then loop through the xml file again to find and print out all the actual blog posts. On each pass I search the entire temporary file for any comments that start with the post url and add those comments to the post

    ...

    Is there a more efficient way of doing this?

    Yes. Store the the comments with the post URL as the key in an hash of arrays, ie

    my %comments = ( 'http://url.to/post' => [ $comment1, $comment2, .. ], ... )

    And on the second pass simply look it up.

    If the data is too large and doesn't fit all into memory, use something like DBM::Deep to store it in a file, but preserve fast lookup.

      How big is too large? My largest xml file is about 20 mbs... I'm a little unclear on how to programatically save everything to the hash and then look it up... I haven't worked with hashes that much in Perl yet. So, the first time I loop through the xml to find the comments, I store each comment as a new entry in the %comments array, right? Something like this?
      ... my $commentText = $date . $author . $content; my %comments = ( $posturl => [$commentText] ) ...

      I can see that working for one comment--would a second comment then overwrite the first in the array or would it add automatically? Would I need to use push or something?

      Then, during the second pass, how exactly would I reference the keyed array?

      Sorry for all the newbie questions... Thanks!

        Try to read perlintro or perldata for more information on hashes, and perlreftut for more involved data structures. Suppose you have a comment stored in $comment, and want to store that it's related to the url $post_url, you'd write:
        my %comments; ... push @{$comments{$post_url}}, $comment;

        And you can retrieve and iterate over the list of comments to an URL:

        for my $c (@{$comments{$post_url}}) { print "$c\n"; }

        20mbs, even with a few copies in hashes and arrays here and there, probably fits in memory without any problem. There is some overhead in perl variables and more in the more complex structures (hashes and arrays) but even so you should be OK unless you have very limited virtual memory available. I would keep everything in memory (from perl's perspective) and let the swapper deal with disk if necessary, unless this proved to be problematic. You might have a look to see how much free RAM is available on your system.