merging .csv text input based on matching column field values in multiple rows

abrg has asked for the wisdom of the Perl Monks concerning the following question:

i'm a relative PERL newbie looking to tap the wisdom of the monks for the first time. i have working code which:

1. reads in a .csv file with the Text::CSV_XS module
2. sets the columns to friendly variable names for later placement in a here doc
3. performs some string formatting and variable setting based on values in certain columns
4. for each line in the csv, generates a predictable XML doc structure using a simple here doc layout and inserts the field values into the required XML elements
5. outputs a physical XML file to an output directory with a file name which cooresponds to the "order_id" field for each row in the input .csv

input csv file looks like:

order_id, name, item_id, item_desc, price
1, bill smith, 11, red shoes, 39.99
2, john doe, 32, black hat, 21.59
3, jane lee, 12, green shoes, 29.99
2, john doe, 11, red shoes, 39.99
[download]

the xml file output for "order_id = 1" should look like:

<order>
  <order_id>1</order_id>
  <name>bill smith</name>
  <item>
    <item_id>11</item_id>
    <item_desc>red shoes</item_desc>
    <price>39.99</price>
  </item>
</order>
[download]

the xml file output for "order_id = 2" should look like:

<order>
  <order_id>2</order_id>
  <name>john doe</name>
  <item>
    <item_id>32</item_id>
    <item_desc>black hat</item_desc>
    <price>21.59</price>
  </item>
  <item>
    <item_id>11</item_id>
    <item_desc>red shoes</item_desc>
    <price>39.99</price>
  </item>
</order>
[download]

in the case of rows with matching "order_id" columns i want to:

1. determine how many matching rows there are in the input .csv
2. create a for loop to output as many new child XML elements as there are duplicate rows in the .csv (i do realize looping and other logic are not allowed inside here docs so i'll need to rethink that)
3. insert only the unique column data from the subsequent matching rows into additional, duplicate child elements (<item>) in the XML document
4. output the XML file to an output directory

hopefully i've provided clear and concise information describing the challenge. thanks for taking a look and for (hopefully) sharing your observations/thoughts on a solution.

Comment on merging .csv text input based on matching column field values in multiple rows Select or Download Code

Replies are listed 'Best First'.
Re: merging .csv text input based on matching column field values in multiple rows by ELISHEVA (Prior) on Feb 25, 2009 at 05:13 UTC
The basic algorithm is "sort and group": Read in your CVS file Sort the rows on id using sort - it is very flexible. Define a variable `$currentId` to store the current id. Initialize it as `undef` or a bogus value to insure that the first id read triggers the generation of XML for an order header. Loop through your records clean up bad data if id is the same as `$currentId`, generate the XML for an item if id is different, reset `$currentId` and print the XML for the order header. Best, beth	[reply] [d/l] [select]
Re^2: merging .csv text input based on matching column field values in multiple rows by abrg (Initiate) on Feb 25, 2009 at 18:40 UTC
beth, countzero, wol thanks for your help/encouragement. before i go much further, i should post the actual working code i currently have which has been modified to fit the input/output scenario above. #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new(); my $file = ''; if (defined $ARGV[0]) { $file = $ARGV[0]; } open(my $data, '<', $file) or die "Could not open '$file'\n"; while (my $line = <$data>) { if ($csv->parse($line)) { my @columns = $csv->fields(); ## set vars my $orderNumber = $columns[0]; my $name = $columns[1]; my $itemId = $columns[2]; my $itemDesc = $columns[3]; my $price = $columns[4]; open (OUTFILE, "> output/$orderNumber.xml") or die $! . " can' +t open the file\n"; #print OUTFILE <<XML; print OUTFILE "<order>\n"; print OUTFILE " <order_id>$orderNumber</order_id>\n"; print OUTFILE " <name>$name</name>\n"; print OUTFILE " <item>\n"; print OUTFILE " <item_id>$itemId</item_id>\n"; print OUTFILE " <item_desc>$itemDesc</item_desc>\n"; print OUTFILE " <price>$price</price>\n"; print OUTFILE " </item>\n"; print OUTFILE "</order>\n"; #XML close (OUTFILE); } else { warn "Line could not be parsed:\n"; } } [download] i've read the sort documentation but i'm still having trouble wrapping my brain around how to actually execute it with the Text::CSV module (forgive me, i realize this is a pretty basic concept). the File::Sort module seems interesting, though, i feel i should be able to solve this natively in PERL (also encountering some strange issues when trying to run this module). essentially, how do i get at the file array and then look inside it to the first column's value on each row? it would appear that my input file is being defined in scalar context where it should be an array. am i getting warmer? ;) thanks again	[reply] [d/l]
Re^3: merging .csv text input based on matching column field values in multiple rows by ELISHEVA (Prior) on Feb 25, 2009 at 19:28 UTC
Showing your code was definitely a good move. Now I have a better idea of why you are finding how to use sort so confusing. To "sort and group" you'll need three passes through the file lines: one to read in the lines, one to sort, and one to print. You'll also need to store the data from the "reading in" phase in an an array of array references. I've included some code demonstrating the algorithm. You'll have to make adaptions to fit your situation, but it should at least show you how to use the sort-n-group algorithm. I don't have the CVS module you are using so I've done a simple chomp and split to emulate it. You'll want to substitute that part with your code to read in CVS. I've also read in sample lines from <DATA> and printed to STDOUT for convenience. It also should make it easy to just download the code and see if it does indeed do what you want. Read more... (3 kB) Best, beth	[reply] [d/l]
Re^3: merging .csv text input based on matching column field values in multiple rows by wol (Hermit) on Feb 26, 2009 at 13:31 UTC
The "sort and group" principle shown in other nodes is sound, but it makes most sense when you're only looking at arrays. Here's an implementation that uses both arrays and hashes, to make an internal perl representation of the data that corresponds more closely (compared to a just arrays) to what you need to output. The output loop therefore becomes a bit simpler. Read more... (4 kB) -- use JAPH; print JAPH::asString();	[reply] [d/l] [select]
Re: merging .csv text input based on matching column field values in multiple rows by CountZero (Bishop) on Feb 25, 2009 at 07:17 UTC
I do realize looping and other logic are not allowed inside here docs so i'll need to rethink that For your next implementation, perhaps think of using a full blown templating solution, such as Template::Toolkit which allows the logic to reside (partially) inside the template. I had a similar problem and solved it by having the Perl script extract the data from the data sources, out it in a standard format (data structure) and then have Template::Toolkit-templates turn it into the required output-formats. Now, every time I need to have a new output format, I just have to write a new template and do not have to write a new script. Alternatively, have a look at XML::Code which allows easy writing of XML-code or XML::Generator which is a bit more involved but has more possibilities. Both modules will take care of escaping "forbidden" characters. Update: How could I forget XML::Writer? This venerable module has everything, but is not easy to use and I find it overkill for some more simple tasks. However YMMV. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: merging .csv text input based on matching column field values in multiple rows by wol (Hermit) on Feb 25, 2009 at 16:12 UTC
I just thought I'd offer my appreciation for managing to post an effective question at your first attempt :-) More of that kind of thing, say I! -- use JAPH; print JAPH::asString();	[reply]
Re: merging .csv text input based on matching column field values in multiple rows by Anonymous Monk on Feb 27, 2009 at 06:25 UTC
matching columns? Sounds like you need a relational database. SQLite is easy to install and use.	[reply]
Re: merging .csv text input based on matching column field values in multiple rows by abrg (Initiate) on Feb 27, 2009 at 20:47 UTC
thanks again for the help on this. the sort/group syntax is making more and more sense to me as i study it. i've gotten my actual production script 95% complete, but i'm stuck now on how to count the number of <items> inside the hash/array data structure in wol's example. i need to be able to set some variables outside the inner-most foreach based on the number of <items>. what's the proper syntax to be able to get a count on that inner array from inside the first foreach?	[reply]
Re^2: merging .csv text input based on matching column field values in multiple rows by abrg (Initiate) on Feb 27, 2009 at 22:43 UTC
figured it out (doh!) `$itemCount = scalar @{$outer{$inner}};` [download]	[reply] [d/l]