syedarshi16 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I want to parse a big .csv file into .xls based on the keywords present on each csv line. There would be redundancy in the keywords which needs to be compiled and send to .xls file, in this way the .xls file will contain all the information with respect to each keywords in one line. I do not want to read .csv file again and again for each keyword as it takes too much of time.

for example:

.csv file format ARL6IP2,298757,Hyperalgesia,MESH:D006930 ARL6IP2,298757,Liver Diseases,MESH:D008107 ARL6IP2,298757,"Liver Failure, Acute",MESH:D017114 ARL6IP2,298757,Liver Neoplasms,MESH:D008113 CCL22,6367,Esophageal Neoplasms,MESH:D004938 CCL22,6367,Fatty Liver,MESH:D005234 CCL22,6367,Fetal Growth Retardation,MESH:D005317 CCL22,6367,Fever,MESH:D005334
.xls file format required () represents separate column
(ARL6IP2) (Hyperalgesia,Hyperalgesia,"Liver Failure, Acute",Liver Neop +lasms) (MESH:D006930,MESH:D008107,MESH:D017114,MESH:D008113) (CCL22) (Esophageal Neoplasms,Fatty Liver,Fetal Growth Retardation,Fev +er) (MESH:D004938,MESH:D005234,MESH:D005317,MESH:D005334)

Any help would be highly appreciated.
Thanks in advance
Arshi

Replies are listed 'Best First'.
Re: csv parsing
by ELISHEVA (Prior) on Sep 29, 2009 at 06:24 UTC

    Welcome to PerlMonks!

    It is customary to surround data samples, and not only code, with <code> tags. Please update your post accordingly. As I'm sure you will agree, it is very hard to read your post. If you aren't sure how to update your post, you can get to a page with an edit box by clicking on your post title and scrolling down.

    As for your questions, you can get a list of modules by querying CPAN's search facility. Here is a link with the search box filled in for you: Csv (116 hits). We also have a large number of threads at PerlMonks discussing CSV file parsing. Here is a Super Search query filled out to search for root nodes with "csv" in the title. Just press the search button and you will see a list of all of the nodes for the last year or so. You can also get to the Super Search screen by clicking on the link in the upper right corner of each PerlMonks page.

    Best, beth

    Update: added super search links.

Re: csv parsing
by BioLion (Curate) on Sep 29, 2009 at 08:42 UTC

    Without proper post formatting it is hard to see what you want, but at a guess your input looks like this? :

    ARL6IP2,298757,Hyperalgesia,MESH:D006930 ARL6IP2,298757,Liver Diseases,MESH:D008107 ARL6IP2,298757,"Liver Failure, Acute",MESH:D017114 ARL6IP2,298757,Liver Neoplasms,MESH:D008113 CCL22,6367,Esophageal Neoplasms,MESH:D004938 CCL22,6367,Fatty Liver,MESH:D005234 CCL22,6367,Fetal Growth Retardation,MESH:D005317 CCL22,6367,Fever,MESH:D005334

    In which case you need to build a hash based on your unique keys (gene names by the look of it). Hashes are perfect for dealing with data that is organised by unique identifiers.
    I would probably go about it like this :

    • open the file and read it line by line
    • parse each csv formatted line into the relevant parts
    • add the third and fourth columns (diease and MeSH term) to an array of arrays referenced by a unique hash key ( the first column )
    • when the file in finished, close it
    • open an Spreadsheet::WriteExcel sheet or a Spreadsheet::DataToExcel (maybe simpler?) and write out to it

    Maybe it would go something like this :

    use warnings; use strict; ## for parsing the CSV use Text::CSV::Simple; ## for writing to excel use Spreadsheet::WriteExcel; my $datafile = ... the one you gave earlier ... # Only capture feilds of interest my $parser = Text::CSV::Simple->new; $parser->want_fields(1, 3, 4, ); my @data = $parser->read_file($datafile); ## data is now read in (if you file is really big, you can do this on- +the-fly too) my %hash = (); ## for storing collated data foreach @data{ ## store terms acording to unique id push @{ $hash{$_->[0]}->[0] }, $_->[1]; ## disease term stored in an + array push @{ $hash{$_->[0]}->[1] }, $_->[2]; ## MeSH term also stored in +an array } ## now print out data # Create a new workbook called simple.xls and add a worksheet my $workbook = Spreadsheet::WriteExcel->new('simple.xls'); my $worksheet = $workbook->add_worksheet(); # The general syntax is write($row, $column, $token). Note that row an +d # column are zero indexed my $row = 0; for my $key ( keys %hash ){ $worksheet->write($row, 0, $key); ## id $worksheet->write($row, 1, ( join ', ', @{ $hash{$key}->[0] } )); ## + diseases $worksheet->write($row, 2, ( join ', ', @{ $hash{$key}->[1] } )); ## + MeSH ++$row; ## move to the next row } print "$0 completed : ".(scalar(localtime))."\n";

    I haven't tested this, but hopefully it will give you some ideas.

    Just a something something...
Re: csv parsing
by Bloodnok (Vicar) on Sep 29, 2009 at 11:12 UTC
    Using the interpretattion of your data provided by BioLion, the following snipper will provide the translation of the raw data using my favourite delimited text file parser Text::xSV:
    use warnings; use strict; use Text::xSV; my $csv = Text::xSV->new(fh => *DATA); my %res; while (my $row = $csv->get_row()) { my $index = shift @$row; $res{$index}->{id} = shift @$row; push @{$res{$index}->{desc}}, shift @$row; push @{$res{$index}->{mesh}}, shift @$row; } my $xls = Text::xSV->new(); $xls->print_row($_, join(q/,/, @{$res{$_}->{desc}}), join(q/,/, @{$res +{$_}->{mesh}})) for keys %res; __DATA__ ARL6IP2,298757,Hyperalgesia,MESH:D006930 ARL6IP2,298757,Liver Diseases,MESH:D008107 ARL6IP2,298757,"Liver Failure, Acute",MESH:D017114 ARL6IP2,298757,Liver Neoplasms,MESH:D008113 CCL22,6367,Esophageal Neoplasms,MESH:D004938 CCL22,6367,Fatty Liver,MESH:D005234 CCL22,6367,Fetal Growth Retardation,MESH:D005317 CCL22,6367,Fever,MESH:D005334
    $ perl tst.pl CCL22(Esophageal Neoplasms,Fatty Liver,Fetal Growth Retardation,Fever) +,(MESH:D004938,MESH:D005234,MESH:D005317,MESH:D005334) ARL6IP2(Hyperalgesia,Liver Diseases,Liver Failure, Acute,Liver Neoplas +ms),(MESH:D006930,MESH:D008107,MESH:D017114,MESH:D008113)
    Since, AFAIK, the .xls format is one of many M$ proprietary formats, I would then ...
    1. Modify my $xls = Text::xSV->new(); to read my $xls = Text::xSV->new(filename => 'foo.xls.csv'); - thus facilitating the saving of the translated file
    2. Use Win32::OLE to start an Excel session to import the generated file (foo.xls.csv) and save it to the target .xls filename - it's a bit hard to demonstrate on a Linux box (yes, I have no Windoze:-)), so any amount of google hits are there provide guidance for this step.
    BTW, use perldoc TextxSV to find how to change the args to the constructor in order to read from your actual file - basically it involves changing fh => *DATA to filename => path_to_file.

    Update 1:

    Oops, I really ought to have looked _closely_ at the results - missing delimiter between the 2 lists - now corrected

    Update 2:

    Having re-read your OP, I see that the brackets were merely placeholders indicating column/field limits, so bearing this in mind, I simplified the snippet (to utilise the print_row() call) and re-structure the print loop.

    A user level that continues to overstate my experience :-))
Re: csv parsing
by Anonymous Monk on Sep 29, 2009 at 06:22 UTC
Re: csv parsing
by ack (Deacon) on Sep 29, 2009 at 19:57 UTC

    As the other responders noted, I also am not entirely sure what you're looking for. But following the general trend regarding what you're after, here is another strategy for handling what I think you're looking for.

    The input file that I used to test the above code looks like this:

    With that input and the above code, the output to the terminal looks like the following:

    1: (CCL22),(Esophageal Neoplasms,Fatty Liver,Fetal Growth Retardation, +Fever),(MESH:D004938,MESH:D005234,MESH:D005317,MESH:D005334) 2: (ARL6IP2),(Hyperalgesia,Liver Diseases,"Liver Failure, Acute",Liver + Neoplasms),(MESH:D006930,MESH:D008107,MESH:D017114,MESH:D008113)

    As suggested by the other responders, this code uses a hash as the mechanism for storing the various codes (e.g., "ARL6IP2" or "CCL22"). Each entry in the has stores a reference to an array which, itself contains references to two other arrays. One of those two arrays is an array of what I designate as the @diseases (e.g., "Hyperalgesia" or "Liver Failure, Acute" associated with each $code) and the other is an array of what I designate as @meshes (i.e., the various "MESH:D006930" type of stuff associated with each $code).

    The one perhaps odd looking construct in the script is:

    $codes{$code} = [[@diseases],[@meshes]];

    The use of the square brackets in the interior are to ensure that all of the entries in the hash don't point to the exact same array. The use of the square brackets in this way is the usual recommended way to ensure that one doesn't continually point to the same data structure.

    I hope this helps show another way to do it.

    ack Albuquerque, NM