Perl300 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am trying to convert a text file into xml using following code:
#!/usr/bin/perl use strict; use warnings; use XML::Writer; use XML::Simple; use XML::LibXML; my $out; my $xml = XML::Writer->new(OUTPUT => \$out, DATA_MODE => 1, DATA_INDEN +T => ' '); $xml->xmlDecl(); $xml->startTag('doc'); my $check_1 = 0; open(my $fh, "<", "20150625163139.xml") or die "Failed to open file: $!\n"; while(<$fh>) { chomp; next if !length; my ($string1, $string2, $subscript_name, $subscript_value) = / ^(.*?):: ([^\s]+) \.([^\s]+)\s+= \s(.*) /x; if ( $check_1 == 0 ) { $xml->startTag($string1); $check_1 += 1; } $xml->startTag($string2); $xml->dataElement($subscript_name => $subscript_value); $xml->endTag(); } $xml->endTag(); $xml->endTag(); $xml->end(); print $out; close $fh;
The file 20150625163139.xml contains 366 lines with format:
GI-eSTB-MIB-NPH::eSTBGeneralErrorCode.0 = INTEGER: 0 GI-eSTB-MIB-NPH::eSTBGeneralConnectedState.0 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBGeneralPlatformID.0 = INTEGER: 2076 GI-eSTB-MIB-NPH::eSTBGeneralFamilyID.0 = INTEGER: 25 GI-eSTB-MIB-NPH::eSTBGeneralModelID.0 = INTEGER: 60436 GI-eSTB-MIB-NPH::eSTBGeneralUnitAddressID.0 = STRING: 000-00802-49393- +076 GI-eSTB-MIB-NPH::eSTBGeneralSettopMac.0 = STRING: b8:16:19:28:18:f3 GI-eSTB-MIB-NPH::eSTBGeneralRemodChan.0 = INTEGER: 3 GI-eSTB-MIB-NPH::eSTBGeneralSettopTime.0 = INTEGER: 1119302620 GPS GI-eSTB-MIB-NPH::eSTBPurchaseStatusUnsentPurchases.0 = INTEGER: 0 GI-eSTB-MIB-NPH::eSTBPurchaseStatusUnackPurchases.0 = INTEGER: 0 GI-eSTB-MIB-NPH::eSTBPurchaseStatusLastSeqNumPurchases.0 = INTEGER: 0 GI-eSTB-MIB-NPH::eSTBPurchaseStatusLastReportBackTimePurchases.0 = INT +EGER: 1118516578 GI-eSTB-MIB-NPH::eSTBPurchaseStatusIppvStatus.0 = INTEGER: false(2) GI-eSTB-MIB-NPH::eSTBOobFrequency.0 = INTEGER: 75250000 GI-eSTB-MIB-NPH::eSTBOobCarrierLock.0 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBOobLostLockCount.0 = Counter32: 0 GI-eSTB-MIB-NPH::eSTBOobDataPresent.0 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBOobEMMDataPresent.0 = INTEGER: false(2) GI-eSTB-MIB-NPH::eSTBOobSNRValue.0 = INTEGER: 24.9 GI-eSTB-MIB-NPH::eSTBOobSNRState.0 = INTEGER: good(4) GI-eSTB-MIB-NPH::eSTBOobAGCValue.0 = INTEGER: 16 GI-eSTB-MIB-NPH::eSTBOobAGCState.0 = INTEGER: good(4) GI-eSTB-MIB-NPH::eSTBOobNetworkPid.0 = INTEGER: 1911 GI-eSTB-MIB-NPH::eSTBOobEMMPid.0 = INTEGER: 5379 GI-eSTB-MIB-NPH::eSTBOobEMMProviderID.0 = INTEGER: 1 GI-eSTB-MIB-NPH::eSTBInBandNumberOfTuners.0 = INTEGER: 2 GI-eSTB-MIB-NPH::eSTBTunerIndex.1 = INTEGER: 1 GI-eSTB-MIB-NPH::eSTBTunerIndex.2 = INTEGER: 2 GI-eSTB-MIB-NPH::eSTBInBandTunerModulationMode.1 = INTEGER: qam256(3) GI-eSTB-MIB-NPH::eSTBInBandTunerModulationMode.2 = INTEGER: qam256(3) GI-eSTB-MIB-NPH::eSTBInBandTunerCarrierLock.1 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerCarrierLock.2 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerPCRLock.1 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerPCRLock.2 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerDataLock.1 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerDataLock.2 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerEMMDataPresent.1 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerEMMDataPresent.2 = INTEGER: true(1) GI-eSTB-MIB-NPH::eSTBInBandTunerFrequency.1 = INTEGER: 195000000 GI-eSTB-MIB-NPH::eSTBInBandTunerFrequency.2 = INTEGER: 501000000 GI-eSTB-MIB-NPH::eSTBInBandTunerAGCValue.1 = INTEGER: 0 GI-eSTB-MIB-NPH::eSTBInBandTunerAGCValue.2 = INTEGER: 0 GI-eSTB-MIB-NPH::eSTBInBandTunerAGCState.1 = INTEGER: poor(2) GI-eSTB-MIB-NPH::eSTBInBandTunerAGCState.2 = INTEGER: poor(2) GI-eSTB-MIB-NPH::eSTBInBandTunerSNRValue.1 = INTEGER: 42.0
When I run the above code for this file, I get error:
Code point \u0016 is not a valid character in XML at ./<script_name>.p +l line 34

Where line 34 is

$xml->dataElement($subscript_name => $subscript_value);

When I remove all the lines from the file 20150625163139.xml and keep only these two lines

GI-eSTB-MIB-NPH::eSTBGeneralErrorCode.0 = INTEGER: 0 GI-eSTB-MIB-NPH::eSTBGeneralConnectedState.0 = INTEGER: true(1)

The same code runs fine and generates following xml

<?xml version="1.0"?> <doc> <GI-eSTB-MIB-NPH> <eSTBGeneralErrorCode> <0>INTEGER: 0</0> </eSTBGeneralErrorCode> <eSTBGeneralConnectedState> <0>INTEGER: true(1)</0> </eSTBGeneralConnectedState> </GI-eSTB-MIB-NPH> </doc>
I searched for error: "Code point \u0016 is not a valid character in XML at ./Call_to_snmpwalk.pl line 32" It seems that this error is being generated due to control characters present in the text which are not allowed in xml. So I have two options:

1) Remove these control characters from the file and then print: I have tried this using

perl -pe's/\x08//g' <20150625163139.xml >20150625163139.xml

But this gives error: Bad name after g' at <script_name>.pl line 13.

2) To actually generate an xml (actual .xml file) from code and put the text that is converted in xml into this file and then read it. Do anyone have any suggestions on point 1 or 2?

Replies are listed 'Best First'.
Re: How to remove error: Code point \u0016 is not a valid character in XML
by AnomalousMonk (Archbishop) on Jun 26, 2015 at 01:13 UTC
    ... I have tried this using

    perl -pe's/\x08//g' <20150625163139.xml >20150625163139.xml

    But this gives error: Bad name after g' at <script_name>.pl line 13.

    How are you invoking the above command? A Perl command invoked from the OS command line via the  -e switch should have a "script name" of "-e" and exactly one line. (Well, one line in Windows, anyway — what is your OS?) Also, you are still redirecting I/O to and from the same file at the same time, and this is still a bad idea (see hippo's previous reply).

    An error message like "Code point \u0016 is not a valid character ..." suggests a source file character encoding problem: what is the encoding of 20150625163139.xml? Proper specification of MODE in the
        open FILEHANDLE,MODE,EXPR;
    command (see open) should insure proper reading of any Unicode encoding from its file, but I have no idea if XML::Writer handles Unicode strings properly.


    Give a man a fish:  <%-(-(-(-<

Re: How to remove error: Code point \u0016 is not a valid character in XML
by stevieb (Canon) on Jun 25, 2015 at 22:40 UTC

    Either your sample data doesn't contain one of the 'bad' lines, or something else is wrong. I just installed the most recent XML modules on perl 5.18.2 and I don't get any errors or issues.

    Make sure you've posted at least one of the lines with the problem, and if you have, try upgrading your modules and see if that helps.

    -stevieb

Re: How to remove error: Code point \u0016 is not a valid character in XML
by graff (Chancellor) on Jun 27, 2015 at 03:21 UTC
    For one thing, I trust you've learned by now (from experience) that whenever you run any shell command like this:
    some_proeccess < some.file > some.file
    The FIRST thing the shell does is truncate "some.file" (i.e. set it's content to zero bytes); THEN it opens some.file as input to be read via the stdin of some_process. The result is: no data read by the process, because there's no longer any data in the file. Hope you have a backup copy...

    For another, when you see something like \u0016, that's a hexadecimal value, You can "grep" for that using a perl one-line like the following:

    perl -CS -ne 'print if /\x{0016}/' < some.file
    Or, if the file is NOT UTF-16, you could just do:
    perl -ne 'print if /\x16/' some.file
    Of course, \u0016 isn't the only character that an XML library would reject, and if your file has a bunch of different ones, it gets tiresome fixing them one codepoint at a time as they get reported in error messages.

    You can look up how valid vs. invalid XML characters are enumerated (many people use regexes - e.g. on stackoverflow), and run a diagnosis on your file(s) before feeding them to your script (or just add code to your script to filter out the bad characters, if you're sure that just deleting them is The Right Thing To Do).

      Thank you for your response graff. Finally I was able to remove the character that was causing this error "Code point \u0016 is not a valid character in XML".

      I just added some code before

      my $out = IO::File->new()

      The code I added is as below:

      `strings $20150625163139.txt > Temp.txt`; my $temp_file = "Temp.txt"; open($fh, ">", $20150625163139.txt) or die "Could not open file '$file +name' $!"; open(my $fh1, "<", $temp_file) or die "Could not open file 'Temp.txt' +$!"; while(<$fh1>){ print $fh $_; } close $fh; close $fh1; unlink $temp_file;

      So "strings $filename > Temp.txt" removes that character causing error and like many occasions UNIX came to my help this time as well :-)

      Thanks you all for your inputs! If there is any way to mark a node as closed, please let me know and I'll mark this one as closed.

        The usual way is to update the original question's title with (Solved).
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: How to remove error: Code point \u0016 is not a valid character in XML
by afoken (Chancellor) on Jun 26, 2015 at 09:46 UTC
      Thanks for all your inputs. @stevieb: I think the problem is at line 50 in the file which I am guessing from the error I see after adding :encoding(UTF-8) in filehandle at line 17 which now looks like open(my $fh, "<:encoding(UTF-8)", "20150626102938.txt")

      (Thanks to AnomalousMonk for sugestion)

      I am on perl v5.10.1 at present and will see if can get it upgraded but it'll be a while before I can get it done.

      @AnomalousMonk: I was trying to do this

      perl -pe's/\x08//g' <20150625163139.xml >20150625163139.xml

      from the code. And I tried using two different file names as well. I just kept getting same error so just left it there. But if I can get it working, then I'll ensure to use two different files there.

      I am sorry for causing confusion about 20150625163139.xml. It is just an existing text file (generated at run time) which I have to use to generate xml. I can change it's extension to .txt though.

      I tried adding

      :encoding(UTF-8)

      in filehandle at line 17 which now looks like

      open(my $fh, "<:encoding(UTF-8)", "20150626102938.txt")

      Before adding the output of script was:

      Code point \u0016 is not a valid character in XML at ./Call_to_snmpwal +k_V_1.pl line 34

      After adding the output of script was:

      utf8 "\xB8" does not map to Unicode at ./Call_to_snmpwalk_V_1.pl line +35, <$fh> line 50. utf8 "\xF3" does not map to Unicode at ./Call_to_snmpwalk_V_1.pl line +35, <$fh> line 50. Code point \u0016 is not a valid character in XML at ./Call_to_snmpwal +k_V_1.pl line 34

      @sundialsvc4: I am sorry for confusion caused by 20150625163139.xml It's just text file with no <?xml version="1.0" encoding="UTF-16" standalone="no"?> at the top I tried adding that tag manually at the top in file 20150625163139.xml just to give it a try but it still gives same error with and without :encoding(UTF-8) added in filehandle

      @afoken: Thanks for adding the context link.

        Hi all,

        I got the line in the input file which is causing this error:

        Code point \u0016 is not a valid character in XML at ./Call_to_snmpwalk_V_1.pl line 35

        The line is:

        GI-eSTB-MIB-NPH::eSTBOobNetworkAddress.0 = STRING: ¸^V^Y(^Xó

        I made a few changes so as to print the generated xml in xml file instead of console and can see that the file generations stops at run time at the above line. That generated xml looks like:

        <doc> <GI-eSTB-MIB-NPH> <eSTBOobNetworkAddress> <0>

        Updated code now is:

        #!/usr/bin/perl use strict; use warnings; use XML::Writer; use XML::Simple; use XML::LibXML; use IO::File; my $out = IO::File->new(">output.xml"); my $xml = XML::Writer->new(OUTPUT => $out, DATA_MODE => 1, DATA_INDENT + => 4); $xml->xmlDecl(); $xml->startTag('doc'); my $check_1 = 0; open(my $fh, "<", "20150626161859.txt") or die "Failed to open file: $!\n"; while(<$fh>) { chomp; next if !length; my ($string1, $string2, $subscript_name, $subscript_value) = / ^(.*?):: ([^\s]+) \.([^\s]+)\s+= \s(.*) /x; if ( $check_1 == 0 ) { $xml->startTag($string1); $check_1 += 1; } $xml->startTag($string2); $xml->dataElement($subscript_name => $subscript_value); $xml->endTag(); } $xml->endTag(); $xml->endTag(); $xml->end(); close $fh; $out->close();
        At least I know what exact line/characters are causing trouble now. I trying to find how to avoid this and will update here if I finally get a solution for it.
Re: How to remove error: Code point \u0016 is not a valid character in XML
by locked_user sundialsvc4 (Abbot) on Jun 26, 2015 at 02:31 UTC

    That file doesn’t look terribly like XML to me.   Nevertheless, what about the <?XML ... line at the very top?   Does it include an encoding= clause?

    e.g.:   <?xml version="1.0" encoding="UTF-16" standalone="no"?>

    It certainly appears that the file does contain UTF ... but does the file declare, in its header, that it does?   If it does not, then that erroneous omission certainly could fool many XML parsers.