OK, so here is a complete answer, that saves each line in a separate file.
First a couple of remarks:
- when I want to write XML I rarely use any module. I know that XML::Writer is available, and that most transformation modules can be used too, but frankly I don't think they save much energy if you know what you are doing. I especially don't like XML::Simple for this kind of use as it makes it quite difficult to control the structure of the XML output. So I just used good ole print statements.
- one thing that might create bugs if you are not careful is special XML charaters: you need to escape
at least & and < or you risk your XML not being valid. If you create attributes, which is not the case here, you also need to escape either " or ' depending on which one you use as a delimiter.
- lastly I don't know in which encoding the input data comes but I'd be willing to bet that it is not UTF-8, or at least that if some day accented characters creep in they will not be in UTF-8, so I stuck an XML declaration specifying ISO-8859-1 as the encoding on top of each file (I know it augments the size of each one but it should not be too bad once the whole thing is tar.gz'd).
So here it is:
#!/usr/bin/perl
my $file_nb="000";
# write labels
my (@labels) = split /\t/, <DATA>;
my @labels= map { sanitize_label( $_) } @labels;
my $file= "data-$file_nb.xml";
open( LABELS, ">$file") or die "cannot open $file: $!";
print LABELS qq{<?xml version="1.0" encoding="ISO-8859-1"?>\n},
"<labels>",
map( { "<col>" . $_ . "</col>"} @labels),
"</labels>\n";
close LABELS;
# write data
while (<DATA>)
{ my %line;
chomp;
@line{@labels} = split /\t/;
$file_nb++;
my $file= "data-$file_nb.xml";
open( XML, ">$file") or die "cannot open $file: $!";
print XML qq{<?xml version="1.0" encoding="ISO-8859-1"?>\n},
qq{<data record_no="$file_nb">},
map( { "<$_>" . xml_escape( $line{$_}), "</$_>"} @labels
+),
"</data>\n";
close XML;
}
# dumb way to make label valid XML names: remove all non word characte
+rs
sub sanitize_label
{ my $label= shift;
$label=~ s/[\W]//g;
return $label;
}
# just escape the minimum: < and &
sub xml_escape
{ my $text= shift;
$text=~ s/&/&/g;
$text=~ s/</</g;
return $text;
}
__DATA__
First Last
Fred Flintstone
Barney Rubble & all
Betty Rubble
Wilma Flintstone
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.