RuyLopez has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've got a csv file, output from a database, which I need to change to an xml file thus:

colContents;nextColContents; ... ;lastColContents
colContents2;nextColContents2; ... ;lastColContents2
...

needs to become:
<row><col1>colContents</col1><col2>nextColContents</col2>...<colN>lastColContents</colN></row>
<row><col1>colContents2</col1><col2>nextColContents2</col2>...<colN>lastColContents2</colN></row>
...

I'm using a regex (er, well, several!) to do this thus:
(pseudo, for each line)
s/;/<row><col1>/
s/;/</col1><col2>/
...
s/;/</colN-1><colN>/
etc

Is the syntax available to more elegantly process the regex like:
(pseudo!)
s / allOccurencesOf';' / (list of replacements) <row><col1>,
</col1><col2>, ..., </colN-1><colN>

?

Thanks for help,

rl
  • Comment on newbie regex question: substituting repeating occurences for different replacements

Replies are listed 'Best First'.
Re: newbie regex question: substituting repeating occurences for different replacements
by gjb (Vicar) on Jul 08, 2003 at 12:12 UTC

    Rather than using regexs for this job, I'd recommend to use Text::CSV or Text::CSV_XS to parse the lines of the file into a list and simply add the tags on an element by element basis.

    IMHO this is a cleaner approach from a software engineering point of view, but more importantly you're sure that CSV will be handled correctly in all cases.

    Hope this helps, -gjb-

      Just to make a minor point (plus get my first post in for 2003! ;)), the original poster's data has semicolons as the separating character instead of the default comma, which can easily be specified in Text::CSV_XS's sep_char attribute in the new() method. Hope I'm not doing anyone's homework:
      #!/usr/bin/perl -w use strict; use Text::CSV_XS; my ( $csv, $xml ); $csv = Text::CSV_XS->new( { 'sep_char' => ';' } ); $xml = ''; while( <DATA> ) { chomp; if ( $csv->parse( $_ ) ) { my ( $line, $n, @fields, $field ); $line = '<row>'; $n = 1; @fields = $csv->fields(); foreach $field ( @fields ) { $line .= "<col$n>$field</col$n>"; $n++; } $xml .= $line . "</row>\n"; } else { print "parse() failed on this line: " . $csv->error_input() . +"\n"; # die? } } print $xml; __DATA__ a;b;c;d;e f;g;h;i j;k l m;n;o
      Output:
      $ ./main.pl
      <row><col1>a</col1><col2>b</col2><col3>c</col3><col4>d</col4><col5>e</col5></row>
      <row><col1>f</col1><col2>g</col2><col3>h</col3><col4>i</col4></row>
      <row><col1>j</col1><col2>k</col2></row>
      <row><col1>l</col1></row>
      <row><col1>m</col1><col2>n</col2><col3>o</col3></row>
      
      There's probably an XML package out there should your required output become more complex or you want to guarantee that you are using a standardized and optimized solution.

      Peace,

      Purdy

        Here is another approach which uses Text::CSV_XS and, gulp, CGI:
        use strict; use warnings; use Text::CSV_XS; use CGI::Pretty qw(-any); my $i; my $csv = Text::CSV_XS->new({sep_char => ';'}); while (<DATA>) { $i = 0; warn and next unless $csv->parse($_); print CGI::row(map eval "CGI::col@{[++$i]}('$_')", $csv->fields); } __DATA__ colContents;nextColContents;lastColContents colContents2;nextColContents2;lastColContents2 colContents3;nextColContents3;lastColContents3 a;b;c;d;e f;g;h;i j;k l m;n;o
        The -any pragma in CGI.pm is pretty nifty, you can actually use it to create XML. No guarantees on valid XML, of course. Just be sure and either append the CGI package name or use the OO interface (else Perl will complain that main has no such method). I was able to increment the <colN> tags by using an evil eval trick. That was the toughest part of this code. Drop that requirement and utilize CGI.pm's distributive shortcuts feature thingy:
        use Text::CSV_XS; use CGI::Pretty qw(-any); my $csv = Text::CSV_XS->new({sep_char => ';'}); $csv->parse($_) and print CGI::row(CGI::col([$csv->fields])) while <DATA>;

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: newbie regex question: substituting repeating occurences for different replacements
by dreadpiratepeter (Priest) on Jul 08, 2003 at 12:13 UTC
    Why use a regexp?
    use strict; my $str; while (<DATA>) { chomp; my @cols = split(/;/); for (my $i=1;$i<=@cols;$i++) { $cols[$i-1]="<col$i>$cols[$i-1]</col$i>"; } $str .= '<row>' . join('',@cols) . "</row>\n"; } print $str; exit; __DATA__ a;b;c;d;e f;g;h;i j;k l m;n;o
    prints:
    <row><col1>a</col1><col2>b</col2><col3>c</col3><col4>d</col4><col5>e</ +col5></row> <row><col1>f</col1><col2>g</col2><col3>h</col3><col4>i</col4></row> <row><col1>j</col1><col2>k</col2></row> <row><col1>l</col1></row> <row><col1>m</col1><col2>n</col2><col3>o</col3></row>

    UPDATE: Doh! of course gjb is right. Text::CSV is a great module. While my code is better than using regexps, Text::CSV is better than using my code.

    -pete
    "Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."
Re: newbie regex question: substituting repeating occurences for different replacements
by AcidHawk (Vicar) on Jul 08, 2003 at 14:01 UTC

    This got me thinking... I often create xml but always with the first tag being different for each .. well where the tag <row> exists here I normally only need

    <row>a</row> <row2>b</row2>
    So here is my stab at doing this with XML::Simple. Please comment on how I can refine this... but I do know it outputs what was asked for... ;-D

    #! /usr/bin/perl use strict; use warnings; use XML::Simple; use Data::Dumper; my (%h); my $xs = XML::Simple->new( keeproot => 1, noattr => 1, noescape => 1); open (FILE, ">>./Test.xml") or die "Cannot create Test.xml: $!\n"; while (<DATA>) { delete $h{row}; my $count = 0; chomp; my @line = split(/;/); foreach my $var (@line) { $count++; my $tag = "col" . $count; $h{'row'}{$tag} = $var; } print Dumper(\%h); print FILE $xs->XMLout(\%h); } close(FILE); __DATA__ a;b;c;d;e f;g;h i; j

    -----
    Of all the things I've lost in my life, its my mind I miss the most.
      Personally, i think you shouldn't concern yourself with incrementing XML tags (even though i played along and posted a solution myself). The first <foo> tag encountered is the first, and the second <foo> tag encountered is the second, and so on - as long as that's how you intend to read the data. XML::Simple handles this by exposing an option (forcearray). Since order will be preserved, why bother with incrementing the tags? (And besides, wouldn't that id number be better as an attribute instead?)

      I posted a few solutions to converting CSV to XML over at CSV to XML (the quick and dirty way). Ultimately, CSV::XML looks the easiest, but i still dig using XML::Generator::DBI. One option i didn't try at the time, however, was DBD::AnyData. Here's one that follows the <colN> naming convention and uses the previous two modules ... but, caveats:

      1. commas are used instead of semi-colons - DBD::AnyData does not handle them, but could be written to do so, if desired. However, note that CSV stands for comma. ;)
      2. ughh ... i have to specify the maximum number of columns to be expected. This is not something i advocate, but then again, this is all because the <colN> have to be incremented.
      Enough, here's the code:
      use strict; use warnings; use DBI; use XML::Generator::DBI; use XML::Handler::YAWriter; my $max = 5; my $data = join(',',map"col$_",1..$max) . do {local $/;<DATA>}; my $dbh = DBI->connect('dbi:AnyData(RaiseError=>1):'); $dbh->func('test', 'CSV', [$data], 'ad_import'); my $generator = XML::Generator::DBI->new( Handler => XML::Handler::YAWriter->new(AsFile => '-'), dbh => $dbh, Indent => 1, ); $generator->execute('select * from test'); __DATA__ colContents,nextColContents,lastColContents colContents2,nextColContents2,lastColContents2 colContents3,nextColContents3,lastColContents3 a,b,c,d,e f,g,h,i j,k l m,n,o

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
Re: newbie regex question: substituting repeating occurences for different replacements
by eric256 (Parson) on Jul 08, 2003 at 16:11 UTC
    A regex like
    my $row = "hello;this;is;a;test"; my $i = 1; print "row: $row\n"; $row =~ s/;/"<\/col". $i++ ."><col$i>"/eg; $row = "<col1>" . $row . "<col$i>"; print "row: $row\n";
    That works. Not the prettiest thing in the world though.
    I have to agree thought that regex is proboly not the best solution for this problem.
    This one is kinda prettier :-)
    my $row = "hello;this;is;a;test"; my $i = 1; print "row: $row\n"; @cols = split(/;/,$row); foreach $col (@cols) { $output .= "<col$i>$col</col$i>"; $i++} print "row: $output\n";
    Notice that these all deal with just a single line though easily expanded to cover multiple lines.
    Eric Hodges
      Or in a pinch,
      while(<DATA>) { my $i; chomp; print join('', map { $i++; "<col$i>$_</col$i>" }, split /;/)."\n"; } __ hello;this;is;a;test this;is;another;simple;test

      Makeshifts last the longest.

        Cool. I always have a hard time figureing out when/how to use map. The more i see it the more it makes sense though. I think its partly because i'm not used to magical vars like $_ yet.

        Eric Hodges