bluray has asked for the wisdom of the Perl Monks concerning the following question:

I tried to match .csv two files and then print the output in a third file. The output file should have all the entries of the first and second file. The two files were matched with respect to a column (ID -numerical). Also, in the output file, one more column is created (column[0]) and that will show whether the entry is present only in first file, second file or both. The first file have columns: <\p>

Group, Functional_Category, LT, IA, ID, Symbol, Description, TID, GID, LO, S1, S2, HL, Status, V1, V2, IN

The second file have less number of columns: ID, Symbol, Description, TID, GID, LO, S1, S2

In the output file, I am trying to get a column 'Match' as my first column. But, when I run the script, I am getting formatting issues. For example, the third row has again the column headers of second file repeated. Another problem I encountered is that I wanted "both" if there is a match between the two files for a particular entry. Instead of that, I am getting the numerical ID on the first column. It is not the case if there is not a match and the column entries were correct for those.

#!/usr/bin/perl use strict; use warnings; my %hash; my $infile="File1.csv"; open (my $in_fh, "<", $infile) or die ($!); my $line=<$in_fh>; chomp $line; my @columnheadings= split (/\t/,$line); while (my $line = <$in_fh>){ chomp $line; $line=~ s/\t/,/g; $line=~tr/[a-z]/[A-Z]/; my @columns=split (/,/,$line); my $Uniqueid=$columns[4]; my $newline; unshift(@columns, 'File1'); $newline=join(",",@columns); if (exists($hash{$Uniqueid})){ print "$hash{$Uniqueid}\n"; } $hash{$Uniqueid}=$newline; } my $discard=<$in_fh>; $infile="File2.csv"; open ($in_fh, "<", $infile) or die ($!); while (my $line= <$in_fh>){ chomp $line; $line=~s/\t/,/g; $line=~tr/[a-z]/[A-Z]/; my @columns= split (/,/,$line); my $Uniqueid=$columns[0]; my $newline; if (exists($hash{$Uniqueid})){ $line=~ s/^File1,/both,/; $newline=$line; } else { $newline="File2,,,,".$line; } $hash{$Uniqueid}=$newline; } $discard=<$in_fh>; my $outfile="OutputFile.csv"; open (my $out_fh, ">", $outfile) or die ($!); unshift (@columnheadings, ('Match')); my $headings=join (",", @columnheadings); print $out_fh "$headings\n"; use Data::Dumper; print Dumper \@columnheadings; foreach my $id (sort keys (%hash)) { my @columns=split(/,/,$hash{$id}); my $printline=$hash{$id}; while ($printline=~s/,,/,NA,/g) { }; print $out_fh "$printline\n"; delete $hash{$id}; }

Hi, Thanks for the reply. I am getting the same output files.

$VAR1 = 'Match', '"Group","Functional_Category","LT","IA","ID","Symbol","Description","TID","GID","LO","S1","S2","HL","Status","V1","V2","IN"' ;

The first columnheading was in single quotes.

Replies are listed 'Best First'.
Re: Issues with Column headings
by moritz (Cardinal) on Sep 18, 2011 at 17:46 UTC
    So, I am suspecting that the columnheadings were improperly read.

    There's an easy way to check that. After reading the column headings, insert this code:

    use Data::Dumper; print Dumper \@columnheadings;

    Run your script, and see if the output matches your expectations.

Re: Issues with Column headings
by choroba (Cardinal) on Sep 18, 2011 at 21:51 UTC
    Please, do not remove the original content. If you want to update it, indicate what parts have been updated. If you want to reply to a comment, use the Reply link. Also, when inserting code, wrap it between <c> and </c>.
Re: Issues with Column headings
by Marshall (Canon) on Sep 19, 2011 at 07:14 UTC
    I found the code a bit confusing and this tab delimited format is a bummer to work with. One of the problems will be that when you add "null" fields, if you don't quote these things, then it is really hard to see these fields in file3.

    Anyway I would suggest using some version of Text::CSV to parse the data and you can adjust the quoting characters to your liking (this module helps on both input and output). Tabs will probably get mangled in example data below, but I think you'll get the idea.

    #!/usr/bin/perl -w use strict; use Text::CSV_XS; open (my $FILE1, '<', "file1.csv") or die "cannot open file1 $!\n"; open (my $FILE2, '<', "file2.csv") or die "cannot open file3 $!\n"; open (my $FILE3, '>', "file3.csv") or die "cannot open file3 $!\n"; my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ , sep_char => "\t", always_quote =>1}); print $FILE3 "Match\t".<$FILE1>; # header for file3 <$FILE2>; # skip header line of file2 my %file1; my %file2; while (my $row = $csv->getline($FILE1)) { my @fields = @$row; my $id = $fields[4]; $file1{$id}=["",@fields]; } while (my $row = $csv->getline($FILE2)) { my @fields = @$row; my $id = $fields[0]; $file2{$id}=["","","","","",@fields,"","","","",""]; } foreach my $id1 (keys %file1) { if (exists $file2{$id1}) { $file1{$id1}[0] ="both"; #both files $csv->print ($FILE3, $file1{$id1}); } else { $file1{$id1}[0] ="1"; #file1 only $csv->print ($FILE3, $file1{$id1}); } } foreach my $id2 (keys %file2) { if (!exists $file1{$id2}) { $file2{$id2}[0] ="2"; #file2 only $csv->print ($FILE3, $file2{$id2}); } }
    file1:
    Group Functional_Category LT IA ID Symbol Descriptio +n TID GID LO S1 S2 HL Status V1 V2 IN q w a a 1 AA some description 10 11 1 s1 +a s2a 1 1 1 1 1 q w b b 2 BB another descrp 11 12 1 s1b + s2b 1 1 1 1 1 q w c c 5 CC despript A 12 13 1 s1c s +2c 1 1 1 1 1
    file2
    ID Symbol Description TID GID LO S1 S2 2 BB another descrp 11 12 1 s1b s2b 4 DD some more stuff 14 14 4 s1d s2d
    file3
    Match Group Functional_Category LT IA ID Symbol D +escription TID GID LO S1 S2 HL Status V1 V +2 IN "1" "q" "w" "a" "a" "1" "AA" "some description" + "10" "11" "1" "s1a" "s2a" "1" "1" "1" "1" + "1" "both" "q" "w" "b" "b" "2" "BB" "another descrp" + "11" "12" "1" "s1b" "s2b" "1" "1" "1" "1" + "1" "1" "q" "w" "c" "c" "5" "CC" "despript A" "12" + "13" "1" "s1c" "s2c" "1" "1" "1" "1" "1" "2" "" "" "" "" "4" "DD" "some more stuff" "14 +" "14" "4" "s1d" "s2d" "" "" "" "" ""
      Hi Marshall, Thanks for the reply and the code. I tried the code. I am getting the output file with only headers.
        You had a tab delimited file eg( split (/\t/,$line); ). Even though the separator is tab instead of comma, this is also called a CSV file. I also used a tab delimited file. The problem is that when I cut and past the file from my text editor into the <code></code> section here, the tabs get converted to spaces! Ugh!

        So, download my data file, and use Notepad or whatever text editor you like to convert the sequences of spaces between the columns back into single tab characters(no spaces before or after the tabs!). The file will look similar on the screen, but there will be a BIG difference in how it is parsed!

        I hate tab delimited files! Tab is a non-printing character and so it is hard to manually modify the files.

        Many of the data files that I work with use "|" for the delimiter and that usually works out well. Some databases will disallow "|" to be in the data fields and so parsing is easy.

        In my code sep_char => "\t" sets the separation character to be tab. The default is "," (comma).