v15 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone, I have a table in 2 column format. Like this
geneA T1 geneA T1 geneA T2 geneB T8 geneC T10 geneC T1
I want to transform it into a table like this
NAMES T1 T2 T8 T10 geneA + + - - geneB - - + - geneC + - - +
So in this case T1 and T2 are present for gene A so we put a + sign but T8 and T10 are absent so we put a - sign. Similarly for others. How can I do this. I tried something like this BUT i am stuck what to do next
#!/usr/bin/perl-w use strict; use warnings; use List::MoreUtils qw(uniq); my %gene2TF2val = (); my @TF = (); while(<>){ chomp; my @s = split /\s+/,$_; push @TF , $s[1]; # pushing every TF into array @TF but this is st +ill not unique list of transcription factors. $gene2TF2val{$s[1]}->{$s[0]} = "-"; } @TF = uniq @TF;
Any help would be appreciated. Thanks

Replies are listed 'Best First'.
Re: transforming a table
by Athanasius (Archbishop) on Apr 04, 2016 at 06:32 UTC

    Hello v15, and welcome to the Monastery!

    I would suggest that you structure the main hash so that each gene name is keyed to an anonymous array of TF values. Then you can use the any function from List::Util to determine whether a given TF corresponds to a given gene:

    #! perl use strict; use warnings; use List::Util qw( any ); my (%gene2TF2val, %TF); while (<DATA>) { my ($gene, $tf) = split; push @{ $gene2TF2val{ $gene } }, $tf; ++$TF{ $tf }; } # Print table header print "\t$_" for sort tf_sort keys %TF; print "\n"; # Print table contents for my $gene (sort keys %gene2TF2val) { # Print one line print $gene; for my $tf (sort tf_sort keys %TF) { print "\t", (any { $_ eq $tf } @{ $gene2TF2val{$gene} }) ? '+' : '-' +; } print "\n"; } sub tf_sort { my ($pre_a, $num_a) = $a =~ /^(\D+)(\d+)/; my ($pre_b, $num_b) = $b =~ /^(\D+)(\d+)/; return $pre_a cmp $pre_b || $num_a <=> $num_b; } __DATA__ geneA T1 geneA T1 geneA T2 geneB T8 geneC T10 geneC T1

    Output:

    16:28 >perl 1585_SoPW.pl T1 T2 T8 T10 geneA + + - - geneB - - + - geneC + - - + 16:30 >

    (The trickiest part is writing the custom sort routine tf_sort to ensure that “T10” comes after “T8” — see sort.)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: transforming a table
by kennethk (Abbot) on Apr 04, 2016 at 13:42 UTC
    Given that you have two sets of non-numeric keys, I would think a hash of hashes is a more logical structure than a hash of arrays. I would also, as Athanasius suggests, separate your data structure construction from your output process. Essentially, build a hash of hashes, create sorted lists of your genes and transcription factors, and then print the table:
    #!/usr/bin/perl use strict; use warnings; use List::MoreUtils qw(uniq); use 5.10.0; my %gene2TF2val; while (<DATA>) { my ($gene, $tf) = split; $gene2TF2val{$gene}{$tf}++; } # ID distinct transcription factors, sorted by value my @tf = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0]} uniq map keys %$_, values %gene2TF2val; # Print table header say join "\t", "NAMES", @tf; # Print table contents for my $gene (sort keys %gene2TF2val) { say join "\t", $gene, map $_ ? '+' : '-', @{$gene2TF2val{$gene}}{@tf}; } __DATA__ geneA T1 geneA T1 geneA T2 geneB T8 geneC T10 geneC T1
    Note that the command line switch -w is (mostly) synonymous with use warnings. If you are not yet comfortable with map and Slices, you can store intermediate results in arrays:
    #!/usr/bin/perl use strict; use warnings; use List::MoreUtils qw(uniq); use 5.10.0; my %gene2TF2val; while (<DATA>) { my ($gene, $tf) = split; $gene2TF2val{$gene}{$tf}++; } # ID distinct transcription factors, sorted by value my @tf; for my $tf (values %gene2TF2val) { push @tf, keys %$tf; } @tf = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0]} uniq @tf; # Print table header say join "\t", "NAMES", @tf; # Print table contents for my $gene (sort keys %gene2TF2val) { print $gene; for my $tf (@tf) { my $has = $gene2TF2val{$gene}{$tf} ? '+' : '-'; print "\t$has"; } print "\n"; } __DATA__ geneA T1 geneA T1 geneA T2 geneB T8 geneC T10 geneC T1

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: transforming a table
by woland99 (Beadle) on Apr 04, 2016 at 17:42 UTC
    If I was doing it in pedestrian way I would make two hashes. One - straight hash for T-names. The other would be hash of hashes with - keys would be gene-names and values would be hashes indexed by T-names.
    Loop through data collecting all T-names encountered into first hash. And setting all the "+" values in the second hash - e.g. if "geneA T2" then set gene_present{geneA}->{T2} = '+'
    Then loop through all the gene names in gene_present keys and all the keys in T-name hash, check if T-name key exists in gene_present hash - if not set it to '-'.