brpsss has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that I need to parse using XML::Parser. Only when I get halfway through the file, the parsing stops with an error.

I have looked at the file and I find that there are weird characters embedded in the text.

What can I do to make this file parse properly ? is there any way to remove these characters from the file ? Thanks..

Replies are listed 'Best First'.
Re: XML file won't parse properly
by mirod (Canon) on Apr 13, 2001 at 15:17 UTC

    Ah! Encoding problems! Don't you love them?

    If the faulty characters look like French characters then chances are that the document you are parsing is in ISO-88591 (all numbers and options should be verified, I am on a slow connection today and can't check them myself). The document should have an XML declaration specifying that encoding (the default encoding is UTF8, maybe UTF16 is accepted too). So the document is not valid XML. You can still tell the parser to use another encoding though. Look for the ProtocolEncoding or something like that for XML::Parser's new method. Adding ProtocolEncoding => 'ISO-8859-1' (there is an example of this in the doc I think) should solve part of your problem.

    I say only part because no matter which encoding the original file is in XML::Parser turns it into UTF8, so when you output the data you then get it in UTF8, which you probably don't want.

    Actually YOU probably don't care, but unless you are extremely lucky, chances are that the rest of the software you use to display/store/whatever your output does not like UTF8.

    What are your otpions there? First you can convert everything back into ISO-8859-1 (which I will call latin 1 from now on because I am tired of typing that stoopid number). You can do that with a substitution that I don't remember right now (darn slow connection!) but that you can find in the code of XML::TiePYX, or using the Unicode::Strings module or using the Perl interface to iconv which name I can't remember either. You can also use the original_string method in the handlers to get the string in the original encoding. The only problem then is that if you have accented characters in attribute values you can't use the values from %atts and you have to hand-parse the opening tag string to extract those values, which is quite error-prone. Perl 5.6.0 also has tr///U and tr///C that supposedly do this but these options are already deprecated and won't be supported in future versions do to changes in the Unicode interface. I'd say your best bet is to get the relevant code from XML::TiePYX, it's a small module, look for 'latin1' in it.

    I have written an article about this a while ago for XML.com, you can find a copy here. It has exemples, but does not include the "XML::TiePYX manoeuvre" as I came accross it later.

    Good luck, encoding problems are one of the worst problem with XML.

Re: XML file won't parse properly
by stefan k (Curate) on Apr 12, 2001 at 21:08 UTC
    It would be very helpful if you'd post the errormessage and (at least part of) your XML documents... there could be tons of possible errors.

    One that comes to mind quickly: is there sth wrong with your DTD and/or charsets? (do you make it an I ISO-8859-1 by using

    <?xml version="1.0" encoding="ISO-8859-1"?>
    as the first line of your XML document, or do you really define all the entitties you use ... whatever ... please give more info

    Regards Stefan K

      I'm sorry... I didn't give complete information in my previous post.

      The error message is:
      not well-formed at line 18208, column 71, byte 511770:
      I can't do anything about the encoding of the file because its rather large and it comes from a third-party

      I can't paste the characters here, maybe I can describe them.. they look like French characters, with diacritics... there are also some ASCII like characters, like arrows, and so on.. not normal text...

      I rather suspect that you're right, and its an encoding issue.. is there any way to sanitize the text before I put it into the XML::Parser module ? Thanks..

        OK, sorry, we must have cross-posted, because this wasn't listed when I initially replied.

        It does indeed appear that you have a possibly mal-formed XML file.

        I should point out that those are probably not ASCII characters, unless the document specifically states such in the initial string... <?xml version="1.0" encoding="ISO-8859-1"?> AFAIK, it's usually UTF-8...

        I would most certainly check with the source of your data, since it's possible the file is corrupt... Also, if this is common, they should be having probelms with whomever they're sending these files to.

        With regards to pre-filtering, you want to be VERY careful with this. Isolate ONLY those characters that are causing the parser to barf & 1)try escaping them, 2)try commenting them out, and only if that doesn't work then 3)try replacing them with whitespace.

        But since this is seemingly a question of mal-formedness, none of those approaches are sure to work...



        Wait! This isn't a Parachute, this is a Backpack!
Re: XML file won't parse properly
by tinman (Curate) on Apr 12, 2001 at 21:57 UTC

    Coincidentally, I was talking about this sometime back on the chatterbox..

    I got this snippet from tye, I think, who in turn said that jcwren had told him :o)

    here goes.. in Perl, you can use regular expressions to match and strip characters..

    $line =~ tr/\x80-\xff//d; $line =~ tr/\x00-\x1f//d; $line =~ s/[\r\n\t]//d;

    The first two lines strip high end control characters from your file and the third line replaces all whitespace and new line characters..

    for more information on what this is, and how its done, I'd suggest you look at this page..
    HTH Update: fixed the regexps.. many thanks to chromatic

Re: XML file won't parse properly
by gregor42 (Parson) on Apr 12, 2001 at 21:22 UTC
    I have a file that I need to parse using XML::Parser. Only when I get halfway through the file, the parsing stops with an error...I find that there are weird characters embedded in the text.

    First you may want to include a link to your file, so we know what you're talking about.

    Second you may want to define "weird characters", so we know what you're talking about.

    Third you may want to include your code so, (you guessed it) we know what you're talking about....

    And finally, you may want to read: Mirod's review of XML::Parser, I personally found this very helpful.



    Wait! This isn't a Parachute, this is a Backpack!

      gregor42, thanks.. I guess it seemed obvious to me.. but probably not to anyone else..

      First of all, the code is just a modification of the XML::Parser example code.

      #!/usr/bin/perl -w use strict; use XML::Parser; # initialize hash that will hold header info my $parser = new XML::Parser(ErrorContext => 4,Handlers => {Start => \ +&handle_start, End => \&handle_end, Char => \&handle_char}); my $counter =0; my @tagdesc; my %tags; # parse the file whose name we specified as a command-line parameter $parser->parsefile(shift); open(OUTPUT, ">tag.desc") or die "No open"; foreach my $keyval(keys %tags) { print OUTPUT $keyval, "\n"; } close OUTPUT; sub handle_start { my $p = shift; my $el = shift; my %attribs = @_; if($el eq 'product_data') { $counter ++; } if($counter) { push(@tagdesc, $el); } } sub handle_char { my ($p, $data) = @_; # print $data,"\n" if $counter; } sub handle_end { my $p = shift; my $el = shift; my %atrribs = @_; my $not_written = 0; if($el eq 'product_data') { $counter --; $not_written = 1;} if($not_written) { my $str = join(':',@tagdesc); @tagdesc = (); if(exists $tags{$str}) { my $cnt = $tags{$str}; $cnt++; $tags{ +$str} = $cnt; } else { $tags{$str} = 1; } $str = undef; $not_written = 0; } }

      Unfortunately, the link isn't publically available... and I hope I did an ok job of defining what the weird characters are in the previous post...

      finally, thanks for the XML::Parser review.. I did learn from it.. but unfortunately, not enough to solve this problem :o(.. as another question, where can I find XML::UM or anything to do with Unicode and Perl ?
      thanks for taking the time out to help a complete newbie.. I do appreciate it..

Re: XML file won't parse properly
by Fastolfe (Vicar) on Apr 13, 2001 at 06:39 UTC
    Have you considered contacting your data source regarding this? XML parsers are built to be extremely strict and anal about their input to avoid turning XML into what HTML is today: ubiquitous malformed documents littering the web with user agents all interpreting bad mark-up in their own ways. If an XML parser is telling you the data is invalid/malformed, perhaps try it in a few parsers. If you're consistently getting errors, I would try to get the authors of the content to fix the problem.