shravnk has asked for the wisdom of the Perl Monks concerning the following question:

Hello-

I have written the following script for parsing XML, and it works for small files. For larger files, however, it runs out of memory. I have inserted the Tie::File module, to no avail. Is there anything I could do to improve I/O? Thanks in advance.

#!/usr/bin/perl $[ = 1; # set array base to 1 print "Enter modifier (e.g ISIN), input file, output file, each separa +ted by a space\n"; $enter = <STDIN>; @fields = split(/ /,$enter, 99999); #save array of modif +iers my @input; use Tie::File; #use Tie:File to get + data from xml doc tie @input, 'Tie::File', $fields[2]; @all = join("", @input); #join all of the li +nes of input together in one string $all = "@all"; @stories = split(/<\?xml/, $all, 99999); #split up the strin +g by story, using the <?xml tag foreach (@stories) { $_ = $_; $match = $_ if m#$fields[1]#; #go through the sto +ries, matching a modifier push (@matches, $match) #if it matches, add + to a new array } open(STDOUT, ">$fields[3]"); #print that new arr +ay to a new file print "@matches";

Replies are listed 'Best First'.
Re: Parsing Large XML
by marto (Cardinal) on Jul 02, 2010 at 17:28 UTC

    User a proper XML parser for working with XML files, such as XML::Twig "XML::Twig - A perl module for processing huge XML documents in tree mode.".

Re: Parsing Large XML
by ikegami (Patriarch) on Jul 02, 2010 at 19:09 UTC
    Using Tie::File takes up *more* memory than not.

    Your memory problems would be avoided by printing the results as you find them instead of saving them in an array and printing them all at once.

    There are other issues.

    • Don't use $[. It causes maintenance issues. It's deprecated. It's scheduled to be removed from the language.
    • Don't reuse STDOUT instead of using a new handle. This also causes confusion.
    • Get rid of the needless third arg to split, or set it to the more appropriate 4.
    • Get rid of the useless $_ = $_;.

      ikegami-

      Sorry, I'm new to Perl, and programming in general, and was somewhat confused by your comments. A couple of questions:

      - What is an acceptable alternative to $[ ?

      - Could you suggest a way to print each result as it is found?

      I greatly appreciate your help.

        What is an acceptable alternative to $[ ?

        Not using it. It adds needless complexity.

        If you insist,

        $[ = 1; $a[$i]
        can be written as
        use constant BASE => 1; $a[$i-BASE]

        At least there's no hidden effect at a distance this way.

        Could you suggest a way to print each result as it is found?

        To print the results instead of saving them in an array, replace the code that places them in the array with the code that print a match.

        for (@stories) { print $fh $_ if /$fields[1]/; }
        A reply falls below the community's threshold of quality. You may see it by logging in.