Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Convert a Tab-Delimeted File to XML

by ehdonhon (Curate)
on Sep 29, 2001 at 19:48 UTC ( [id://115631]=perlquestion: print w/replies, xml ) Need Help??

ehdonhon has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm looking for an easy way to take a text file that contains tab-delimited data and convert it to an XML file (even better, multiple XML files -- one for each row). The first row contains the names of the fields, and the rest of the rows correspond to individual records.

I know there are a ton of XML packages out there, so my feeling is that somebody must have already tackled this problem. Can anybody point me in the right direction?

Thanks

Replies are listed 'Best First'.
Re: Convert a Tab-Delimeted File to XML
by merlyn (Sage) on Sep 29, 2001 at 20:17 UTC
    #!/usr/bin/perl use XML::Simple; my @file; my (@labels) = split /\s+/, <DATA>; while (<DATA>) { chomp; my %line; @line{@labels} = split /\t/; push @file, \%line; } print XMLout(\@file); __DATA__ First Last Fred Flintstone Barney Rubble Betty Rubble Wilma Flintstone
    generates
    <opt> <anon First="Fred" Last="Flintstone" /> <anon First="Barney" Last="Rubble" /> <anon First="Betty" Last="Rubble" /> <anon First="Wilma" Last="Flintstone" /> </opt>
    Season to taste.

    -- Randal L. Schwartz, Perl hacker

      I hate to point out something in Randalls code, but shouldn't the first split be on tabs and not on one or more spaces ? a label "First Name" could exist.
        Good point - personally, i hate tab-delimited files. If my labels can contain whitespace, then i will use something like a colon or a semi-colon or maby even -=:TOMMY_LEE:=-

        i think merlyn chose \s+ to take care of situations where tabs and spaces could be inter-mixed.

        it all boils down to TIMTOWTDI ;)

        jeffa

Re: Convert a Tab-Delimeted File to XML
by mirod (Canon) on Sep 29, 2001 at 20:47 UTC

    AnyData can read data in various formats, including tab-delimited, treat them as relational tables, and export them in other formats, including XML.

    Something like adConvert('Tab','foo.tab','XML','foo.xml'); should work.

    Of course you can also do it simply by reding the tab delimited file, using split to extract the individual fields and then printing them as you see fit. The requirement to export each single row in a different file is unlikely to be directly supported by any module.

Re: Convert a Tab-Delimeted File to XML
by darobin (Monk) on Sep 30, 2001 at 01:07 UTC

    You might want to take a look at XML::SAXDriver::CSV, it's a package that's meant precisely for this purpose. It has the added advantage that it is SAX based. This means that not only can it be used to interact with other XML modules easily, but making it write each row to a different file will be as easy as writing a SAX handler that does just that. It's pretty trivial, and a very powerful approach.

    -- darobin -- knowscape 2 coming soon --

Re: Convert a Tab-Delimeted File to XML
by mirod (Canon) on Sep 30, 2001 at 11:06 UTC

    OK, so here is a complete answer, that saves each line in a separate file.

    First a couple of remarks:

    • when I want to write XML I rarely use any module. I know that XML::Writer is available, and that most transformation modules can be used too, but frankly I don't think they save much energy if you know what you are doing. I especially don't like XML::Simple for this kind of use as it makes it quite difficult to control the structure of the XML output. So I just used good ole print statements.
    • one thing that might create bugs if you are not careful is special XML charaters: you need to escape at least & and < or you risk your XML not being valid. If you create attributes, which is not the case here, you also need to escape either " or ' depending on which one you use as a delimiter.
    • lastly I don't know in which encoding the input data comes but I'd be willing to bet that it is not UTF-8, or at least that if some day accented characters creep in they will not be in UTF-8, so I stuck an XML declaration specifying ISO-8859-1 as the encoding on top of each file (I know it augments the size of each one but it should not be too bad once the whole thing is tar.gz'd).
    • So here it is:

      #!/usr/bin/perl my $file_nb="000"; # write labels my (@labels) = split /\t/, <DATA>; my @labels= map { sanitize_label( $_) } @labels; my $file= "data-$file_nb.xml"; open( LABELS, ">$file") or die "cannot open $file: $!"; print LABELS qq{<?xml version="1.0" encoding="ISO-8859-1"?>\n}, "<labels>", map( { "<col>" . $_ . "</col>"} @labels), "</labels>\n"; close LABELS; # write data while (<DATA>) { my %line; chomp; @line{@labels} = split /\t/; $file_nb++; my $file= "data-$file_nb.xml"; open( XML, ">$file") or die "cannot open $file: $!"; print XML qq{<?xml version="1.0" encoding="ISO-8859-1"?>\n}, qq{<data record_no="$file_nb">}, map( { "<$_>" . xml_escape( $line{$_}), "</$_>"} @labels +), "</data>\n"; close XML; } # dumb way to make label valid XML names: remove all non word characte +rs sub sanitize_label { my $label= shift; $label=~ s/[\W]//g; return $label; } # just escape the minimum: < and & sub xml_escape { my $text= shift; $text=~ s/&/&amp;/g; $text=~ s/</&lt;/g; return $text; } __DATA__ First Last Fred Flintstone Barney Rubble & all Betty Rubble Wilma Flintstone
(jeffa) Re: Convert a Tab-Delimeted File to XML
by jeffa (Bishop) on Sep 30, 2001 at 06:57 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://115631]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-29 13:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found