http://qs1969.pair.com?node_id=478722

rogue90 has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a pretty simple script using xml::parser to parse data into an array. The test data I was using was not well formed, but well formed enough that all I had to do was add a root node. It worked. I get a bigger chunk of test data and what do you know, their are ampersands all over it. Very not well formed. Is there a way to process the data as it is getting fed into the parser on the fly so I could replace with entities? I would like to avoid creating a temp file as it is rather large. My code... minus the subs
#!/usr/bin/perl use strict; use XML::Parser; my $xmlfile = shift; die "Cannot find file \"$xmlfile\"" unless -f $xmlfile; my $count = 0; my $tag = ""; my $encode; my $parser = new XML::Parser; $parser->setHandlers( Start => \&startElement, End => \&endElement, Char => \&characterData, Default => \&default); $parser->parsefile($xmlfile);

Replies are listed 'Best First'.
Re: XML::Parser sigh...
by runrig (Abbot) on Jul 27, 2005 at 21:35 UTC
    Use parse instead of parsefile, and pass it an IO handle. Filter the data before it gets to the parser. There are several ways to do that. You could write a separate program and pipe the output to this one (and pass the parser STDIN), fork a child process which reads/filters/outputs the file and the parent process reads from it (easy with pipe), or create a tied filehandle which filters input (which may work if XML::Parser will accept a tied filehandle).
      brilliant. Thanks. I have been looking at this for so long I didn't even notice parse in the docs.
Re: XML::Parser sigh...
by GrandFather (Saint) on Jul 27, 2005 at 21:19 UTC

    Can you provide a small sample of your input data that demonstrates the problem, an example of the result you are getting and the result you expect?


    Perl is Huffman encoded by design.
      I see a couple of options here, one of which I don't know if its possible. Easy would be to create a new file with the ampersands parsed out. Thats pretty clunky though. I am hoping I can replace them on the fly before the parser sees them. I am new to XML::Parser though and I am not sure if its possible.
      The error for the ampersand is
      not well-formed (invalid token) at line...
      sure:
      <data> <RECORD> <id>1381</id> <title>Water & Fluids</title> <year>1986</year> </RECORD> ... </data>

        The parsers that I've looked at require the whole XML documents. Internally they may swallow a bite at a time, but you've got to give them the whole thing as either a string or a file handle.

        I guess options are either preprocess the whole document or use a tied variable passed to the parser to process the file as the parser reads it.


        Perl is Huffman encoded by design.
Re: XML::Parser sigh...
by BaldPenguin (Friar) on Jul 27, 2005 at 22:49 UTC
    You could read the text into a scalar and regex the &'s out. See Re^4: Why XML not well formed? for a regex that would work. Granted, if this is a large document, this would probably not be the answer.

    Don
    WHITEPAGES.COM | INC
    Everything I've learned in life can be summed up in a small perl script!