Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Building XML with tab delimited

by sannag (Sexton)
on Aug 10, 2016 at 22:54 UTC ( [id://1169537]=perlquestion: print w/replies, xml ) Need Help??

sannag has asked for the wisdom of the Perl Monks concerning the following question:

I am relative new to Perl. Please check to see what I am doing wrong.

I have following input tab delimited file, which need to be built to XML, 'pin'number is the unique identifier. If the 'Pin' matches then populate show multiple 'CHARGE' under the report.

Reason1 Reason2 Reason3 Pin Name Zip Date Time data1 data2 data3 Pin 1 data5 data6 data7 data8 data1 data2 data3 Pin 1 data5 data6 data9 data10 data1 data2 data3 Pin 1 data5 data6 data11 data12

I want it to build following XML

<XML_FILE> <REPORT TYPE="AB"> <REASON1>data1</REASON> <REASON2>data2</REASON2> <REASON3>data3</REASON3> <PERSON> <PIN>Pin 1</PIN> <NAME>data5</NAME> <ZIP>data6<ZIP> </PERSON> <CHARGE> <DATE>data7</DATE> <TIME>data8</TIME> </CHARGE> <CHARGE> <DATE>data9</DATE> <TIME>data10<TIME> </CHARGE> <CHARGE> <DATE>data11</DATE> <TIME>data12</TIME> </CHARGE> </REPORT> </XML_FILE>

Below is the code I have

use strict; use XML::LibXML; my $READFILENAME = "SomeDir\\data.txt"; my $WRITEFILENAME = "SomeDir\\test.xml"; my $doc = XML::LibXML::Document->new('1.0'); my $root = $doc->createElement("XML_FILE"); open (FILEWRITE, ">$WRITEFILENAME"); open (READFILE, $READFILENAME); my $copy_person_pin = "XX"; foreach (<READFILE>) { my $line = $_; chomp $line; my @data = split(/\t/,$line); my $reason1 = $data[0]; my $reason2 = $data[1]; my $reason3 = $data[2]; my $person_pin = $data[3]; my $name = $data[4]; my $zip = $data[5]; my $date = $data[6]; my $time = $data[7]; my $report = $doc->createElement("REPORT"); if ($person_pin ne $copy_person_pin) { # Build the Report tags # I had to put $report out of if loop so that $report is avalibl +e in else statement # my $report = $doc->createElement("REPORT"); $report->setAttribute('TYPE'=>'AB'); my @sortedReportTag = qw ( REASON1 REASON2 REASON3 ); my %reportHashTags; @reportHashTags { @sortedReportTag } = ($reason1, $reason2, $reason3 ); + buildXMLElements(\@sortedReportTag, \%reportHashTags, $report); $root-> appendChild($report); # Build the element for Person Tag my $person = $doc->createElement("PERSON"); my @sortedPersonTag = qw ( PIN NAME ZIP); my %personHashTags; @personHashTags { @sortedPersonTag } = ($person_pin, $name, $zip ); # Build the elements for Person Tag + buildXMLElements(\@sortedPersonTag, \%personHashTags, $person); $report-> appendChild($person); # Build the elements for Charge Tag my $charge = $doc->createElement("CHARGE"); my @sortedChargeTag = qw ( DATE TIME ); my %chargeHashTags; @chargeHashTags { @sortedChargeTag } = ($date, $time ); # Build the elements for Charge Tag + buildXMLElements(\@sortedChargeTag, \%chargeHashTags, $charge); $report-> appendChild($charge); $copy_person_pin = $person_pin; } else { my $charge = $doc->createElement("CHARGE"); my @sortedChargeTag = qw ( DATE TIME ); my %chargeHashTags; @chargeHashTags { @sortedChargeTag } = ($date, $time ); # Build the elements for Charge Tag + buildXMLElements(\@sortedChargeTag, \%chargeHashTags, $charge); $report-> appendChild($charge); } } $doc->setDocumentElement($root); # Write the XML to a file print FILEWRITE ($doc->toString()); close FILEWRITE; sub buildXMLElements() { my($elementTags, $hashTags, $parentElement) = @_; for my $name (@$elementTags) { my $reportTag = $doc->createElement($name); my $reportValue = $hashTags->{$name}; $reportTag->appendTextNode($reportValue); $parentElement->appendChild($reportTag); } }
The output I get is, basically only the first 'CHARGE' is showing up not the other two.
<XML_FILE> <REPORT TYPE="AB"> <REASON1>data1</REASON> <REASON2>data2</REASON2> <REASON3>data3</REASON3> <PERSON> <PIN>Pin 1</PIN> <NAME>data5</NAME> <ZIP>data6<ZIP> </PERSON> <CHARGE> <DATE>data7</DATE> <TIME>data8</TIME> </CHARGE> </REPORT> </XML_FILE>

Replies are listed 'Best First'.
Re: Building XML with tab delimited
by kcott (Archbishop) on Aug 11, 2016 at 08:43 UTC

    G'day sannag,

    Welcome to the Monastery.

    Your input is basically CSV, with tabs instead of commas, so you can use Text::CSV to parse it. If you have Text::CSV_XS installed, Text::CSV will use that and it should run faster.

    For creating the XML, I've used various XML::LibXML methods from XML::LibXML::Document, XML::LibXML::Node and XML::LibXML::Element.

    Here's the script:

    #!/usr/bin/env perl use strict; use warnings; use autodie qw{:all}; use Text::CSV; use XML::LibXML; my $tsv_file_in = 'pm_1169537_input.tsv'; my $xml_file_out = 'pm_1169537_output.xml'; my $dom = generate_xml($tsv_file_in); output_xml($dom, $xml_file_out); sub generate_xml { my ($tsv_file_in) = @_; open my $in_fh, '<', $tsv_file_in; my $csv = Text::CSV::->new({sep_char => "\t"}) or die 'Text::CSV::->new() FAILED: ', Text::CSV::->error_diag( +); my $dom = XML::LibXML::Document::->new(qw{1.0 UTF-8}); my $root = $dom->createElement('XML_FILE'); $dom->setDocumentElement($root); my @headers = map { uc } @{$csv->getline($in_fh)}; my %header_index_for = map { $headers[$_] => $_ } 0 .. $#headers; my ($last_pin, $report) = ('', undef); while (my $row = $csv->getline($in_fh)) { next if @$row == 1 && ! length $row->[0]; if ($row->[$header_index_for{PIN}] ne $last_pin) { $last_pin = $row->[$header_index_for{PIN}]; $report = XML::LibXML::Element->new('REPORT'); $report->setAttribute(TYPE => 'AB'); $root->addChild($report); for (qw{REASON1 REASON2 REASON3}) { my $reason = XML::LibXML::Element->new($_); $reason->appendText($row->[$header_index_for{$_}]); $report->addChild($reason); } my $person = XML::LibXML::Element->new('PERSON'); $report->addChild($person); for (qw{PIN NAME ZIP}) { my $person_datum = XML::LibXML::Element->new($_); $person_datum->appendText($row->[$header_index_for{$_} +]); $person->addChild($person_datum); } } my $charge = XML::LibXML::Element->new('CHARGE'); $report->addChild($charge); for (qw{DATE TIME}) { my $charge_datum = XML::LibXML::Element->new($_); $charge_datum->appendText($row->[$header_index_for{$_}]); $charge->addChild($charge_datum); } } return $dom; } sub output_xml { my ($dom, $xml_file_out) = @_; open my $out_fh, '>', $xml_file_out; print $out_fh $dom->toString(1); return; }

    The input, pm_1169537_input.tsv, is exactly what you posted.

    The output, pm_1169537_output.xml, looks exactly as you wanted (except for the erroneous end tag, </REASON>, which now appears correctly as </REASON1>):

    <?xml version="1.0" encoding="UTF-8"?> <XML_FILE> <REPORT TYPE="AB"> <REASON1>data1</REASON1> <REASON2>data2</REASON2> <REASON3>data3</REASON3> <PERSON> <PIN>Pin 1</PIN> <NAME>data5</NAME> <ZIP>data6</ZIP> </PERSON> <CHARGE> <DATE>data7</DATE> <TIME>data8</TIME> </CHARGE> <CHARGE> <DATE>data9</DATE> <TIME>data10</TIME> </CHARGE> <CHARGE> <DATE>data11</DATE> <TIME>data12</TIME> </CHARGE> </REPORT> </XML_FILE>

    Actually, you didn't show the XML declaration: modify to suit if want something different.

    Update: Oops! Spotted a bug in my code. Two instances of $headers[$header_index_for{PIN}] now corrected to $row->[$header_index_for{PIN}].

    — Ken

      Wow Ken! Thank you so much. This solution works. I have been programming in Perl in no more than 10 days....so there are lot of unknowns for me. This has been a great resource. Thanks again!
        "Wow Ken! Thank you so much. This solution works."

        You're welcome.

        "I have been programming in Perl in no more than 10 days....so there are lot of unknowns for me. This has been a great resource."

        While I did see your "I am relative new to Perl.", I wasn't aware just how new. Not a bad effort at all for a total beginner. Here's some additional resources (in no particular order).

        • If you haven't already done so, I recommend you bookmark "http://perldoc.perl.org/perl.html". This has links to tutorials, FAQs and documentation. [Hint: If you just want to look up a function, the "Perl functions A-Z" index (that's the "Functions" link in the sidebar) is much faster, and easier to navigate, than the perlfunc page, which is very large and can take substantially longer to load.]
        • Prefer lexical filehandles and the 3-argument form of open. The filehandles you've used are package variables, with global scope: in larger programs, where you've perhaps unwittingly used something with a non-specific name (like FILEWRITE) in more than one place, you can easily introduce hard-to-track-down bugs. Furthermore, when a lexical filehandle goes out of scope, Perl will automatically close it for you: one more thing you don't have to worry about and a potential point of error removed.
        • I/O, and other operations, should be checked for failure. This can be tedious and error-prone: let Perl handle this for you with the autodie pragma.
        • Use lexical variables in the smallest scope possible. See "perlsub: Private Variables via my()" for more on this.

        You can find examples (of all of the code-related points above) in my posted solution.

        — Ken

        Hey sannag,

        Aah, I didn't realize you were totally new to perl and I went ahead suggesting that you use references! Anyway, welcome to the monastery and have fun!

Re: Building XML with tab delimited
by choroba (Cardinal) on Aug 11, 2016 at 00:21 UTC
    XML::LibXML is great for XML processing, but not so great for producing it. Some people prefer XML::Writer for that, but for a fixed structure like here, you can also use Template:
    #!/usr/bin/perl use warnings; use strict; use Template; my %data; <> for 1, 2; # Skip header. while (<>) { chomp; my ($r1, $r2, $r3, $pin, $name, $zip, $date, $time) = split /\t/; $data{$pin}{reasons} = [ $r1, $r2, $r3 ]; @{ $data{$pin} }{qw{ name zip }} = ($name, $zip); push @{ $data{$pin}{charge} }, [ $date, $time ]; } my $template = 'Template'->new; $template->process(*DATA{IO}, { data => \%data }); __DATA__ <XML_FILE> [% FOR pin IN data.keys %] <REPORT TYPE="AB"> [%- i = 1 ; FOR r IN data.$pin.reasons %] <REASON[% i %]>[% r %]</REASON[% i %]>[% i = i + 1; END %] <PERSON> <PIN>[% pin %]</PIN> <NAME>[% data.$pin.name %]</NAME> <ZIP>[% data.$pin.zip %]</ZIP> </PERSON> [% FOR charge IN data.$pin.charge %] <CHARGE> <DATE>[% charge.0 %]</DATE> <TIME>[% charge.1 %]</TIME> </CHARGE> [% END %] </REPORT> [% END %]</XML_FILE>
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thank you! to be honest I hit a mental block using XML:LibXML. I already invested quit a bit of time building the program in XML::LibXML if I fail to find a solution with in LibXML I will switch to XML:Writer. :(

      Thank you, choroba, for your solution.

      (I was able to run it on my Strawberry Perl installation.)

      I learned about the Template module, the front-end module to the Template Toolkit.

      I also learned about *DATA{IO}, which apparently points to the data below the special literal __DATA__.

      I used Super Search and found the Using the DATA file handle for ARGV Perl Monk Meditation. This cleared up my questions about the special literal __DATA__ values IO slot, *DATA{IO}, of the *DATA typeglob.

Re: Building XML with tab delimited
by Marshall (Canon) on Aug 11, 2016 at 04:18 UTC
    I do not have a solution, but I have some suggestions that may help.

    • I reformatted the code to a more consistent indentation style. You can use a different one, but consistency helps me in the understanding.
    • I converted the code to use the DATA segment for reading instead of an input file.
    • I added "use warnings;" this showed a number of issues. You need to skip these first 2 lines of the input file. Use of prototypes, e.g.sub buildXMLElements() {} should just be sub buildXMLElements {}. However this is not the main problem.
    • I added in some print statements that may help somebody else figure out what the problem is. I don't see it at the moment.
    My code produces the output to STDERR: And output file: reformatted code:
      Thank you very much for the effort. Once thing I noticed is it matters where I place the line "my $report = $doc->createElement("REPORT");" if I placed it outside the while loop and then remove the "else" to replace it with the statement "if ($person_pin eq $copy_person_pin)" then all my 'CHARGE' tag start appearing in the XML but then the XML struct is altered. I am beginning to believe XML:LibXML is very funky.
        This is actually a very normal thing. I saw the comment in the code. You cannot "sometimes" and "sometimes not" create a "my" variable in some conditional statement and expect that variable will around for use elsewhere. If you do that, you will only have use of it during that "if" clause only. If you violate that, then I think the results are undefined and unpredictable. Maybe I didn't understand what you meant here. This is a very complex module and I don't see the problem and admit such.
        Hello sannag,

        Marshall's helpful debugging should have gotten you closer, at least. :-) That said, do note that XML::LibXML has lots of backward compatibility interfaces and it's fairly complex to work with. There's one small change that can help your case:

        In your buildXMLElements sub call, you're passing in the parent element directly as it is and appending child elements into it. If you pass in a ref to it instead, you can see your <CHARGE> elements appear.

        HTH. Good luck and have fun!

Re: Building XML with tab delimited
by Anonymous Monk on Aug 11, 2016 at 03:34 UTC
    Get rid of prototypes, instead of sub name() { ... write sub name { ... and report what happens
      Well, tried it but made no difference :(

        What AnonyMonk is getting at is that you define  sub buildXMLElements() { ... } to take no arguments and then pass it a bunch of arguments. This only works because you call the function in a way that defeats prototype checking. If you had enabled warnings, Perl would have chided you about this situation:

        c:\@Work\Perl\monks>perl -le "use warnings; use strict; ;; func('hi there'); ;; sub func () { print @_; } " main::func() called too early to check prototype at -e line 1. hi there

        Prototypes can be very useful for what they're useful for. But why use them in a way that defeats their purpose and also means you have to disable warnings so it won't nag you endlessly? Please see Prototypes in perlsub, and also the very enlightening Far More than Everything You've Ever Wanted to Know about Prototypes in Perl -- by Tom Christiansen.


        Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1169537]
Approved by Paladin
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-26 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found