Merging/Rearranging Tables

homeveg has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Merging/Rearranging Tables by liverpole (Monsignor) on Feb 09, 2007 at 18:36 UTC
Hi homeveg, I think you will want to do this with a couple of passes. In the first pass, read each ID, and keep track of how many columns each table has, as well as saving individual values per table. In the second pass, for each ID, collect the corresponding values from each table, or assign the appropriate number of "n.a." (if the ID isn't defined for the table). For example: # Strict use strict; use warnings; # Libraries use Data::Dumper; # User-defined + # Define tables, each one a separate value in the hash "$ptables" my $ptables = { 'tab1' => [ "ID column \| column 1", "gene 1 \| value 1.1", "gene 2 \| value 2.1", "gene 4 \| value 4.1", "gene 8 \| value 8.1", ], 'tab2' => [ "ID column \| column 1 \| column2", "gene 1 \| value 1.1 \| value 1.2", "gene 3 \| value 3.1 \| value 3.2", "gene 4 \| value 4.1 \| value 4.2", ] }; # Globals my %output; my %ncolumns; my %values; my @tables = (sort keys %$ptables); # Get all table na +mes # Main program # First pass -- parse each table to fetch all the IDs print "=== Pass 1 ===\n"; foreach my $table (@tables) { my $ptab = $ptables->{$table}; # Assign to table my @rows = split(/\s\\|\s/, shift @$ptab); # Get column headi +ngs shift @rows; # Discard "ID colu +mn" my $ncols = @rows; # Find number of c +olumns $ncolumns{$table} = $ncols; # Save # of column +s print "Reading $table; $ncols col(s)...\n"; # Announce table n +ame foreach my $line (@$ptab) { my ($id,@vals) = split(/\s\\|\s/, $line); # Get ID and value +s $output{$id} \|\|= [ ]; # Placeholder for +ID $values{$table}{$id} = [ @vals ]; # Save values for +table/ID } } # Second pass -- process each ID, adding values from each table my @ids = (sort keys %output); print "=== Pass 2 ===\n"; foreach my $id (@ids) { print "Processing ID $id\n"; my $pout = $output{$id}; # Get current ID l +ist foreach my $table (@tables) { my $ncols = $ncolumns{$table}; # Get number of co +lumns my $pvalues = $values{$table}{$id}; # Get values for t +able/ID if (defined($pvalues)) { push @$pout, @$pvalues; # Save values } else { push @$pout, ( "n.a." ) x $ncols; # Missing value = +N/A } } } # Verify results print "=== Verify results ===\n"; foreach my $id (@ids) { my $pvalues = $output{$id}; printf "%12.12s \| %s\n", $id, join(" \| ", @$pvalues); } [download] The output of which is: `gene 1 \| value 1.1 \| value 1.1 \| value 1.2 gene 2 \| value 2.1 \| n.a. \| n.a. gene 3 \| n.a. \| value 3.1 \| value 3.2 gene 4 \| value 4.1 \| value 4.1 \| value 4.2 gene 8 \| value 8.1 \| n.a. \| n.a.` [download] Of course, you can always add more tables to the master table `$ptables`. s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l] [select]
Re^2: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 11, 2007 at 02:05 UTC
Thank you a lot! It looks like ready solution for me! I still have some problems with Hashes (but I am learning...), therefore, I still have a question: ----------- How to fill HoA with the values? Will the following code work for me? `# define files containing tables (tab-delimited text files) my @files = ['tab1.txt','tab2.txt']; # Define master hash "$ptables" my %ptables; #read files and add data to the HoA: foreach my $file (@files) { my @string_array = Read_File($file); # define arrays of strings my $hash_key =~ s/\.txt//; # generate hash key $ptables{$hash_key} = [ @string_array ]; # save all strings to the + hash }` [download]	[reply] [d/l]
Re^3: Merging/Rearranging Tables by liverpole (Monsignor) on Feb 11, 2007 at 02:58 UTC
"Thank you a lot!" You're welcome a lot! "Will the following code work for me?" Well, I'm tempted to ask -- "what happens when you try?". The best way to learn is, after all, by trying. I do see that you're trying to initialize `@files` from a list reference `['tab1.txt','tab2.txt'];` which likely won't do what you're expecting (you'll get a single item in `@files`, which itself is a reference to the 2-item list). Better to declare it like this: `my @files = ('tab1.txt','tab2.txt'); # Or, using the "quote-word" function "qw", # which lets you omit the quotes and the comma: my @files = qw( tab1.txt tab2.txt );` [download] Furthermore, you're creating a variable `$hash_key` which you're never assigning to, but rather trying to perform a regex substitution on with: `my $hash_key =~ s/\.txt//; # generate hash key` [download] That's why, when I run it with `use strict;` and `use warnings;` I get the error: `Use of uninitialized value in substitution (s///) at merge.pl line 16.` [download] So I'm assuming what you want instead is to assign to the filename `$file`, and then perform the substitution to get the resulting hash key: `(my $hash_key = $file) =~ s/\.txt//; # generate hash key` [download] A final thought: make liberal use of Data::Dumper to see what data a given data structure contains at any time. For example, to see the entire contents of `$ptables` after making each assignment: `$ptables{$hash_key} = [ @string_array ]; # save all strings to the has +h print Dumper(\%ptables); # use "\" to pass reference o +f hash` [download] Update: fixed typo (thanks johngg). s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l] [select]
Re^4: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 11, 2007 at 11:30 UTC
Re^5: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 11, 2007 at 12:05 UTC
Re: Merging/Rearranging Tables by talexb (Chancellor) on Feb 09, 2007 at 17:22 UTC
It sounds like a HoA (hash of arrays) would do the job fine. If this doesn't make sense to you, I'm sure some of the other monks will be able to jump in with sample code. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re^2: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 11, 2007 at 01:08 UTC
Thanks for suggestion! Well, I was looking to the HoA, but I had some problems with proper hash organisation, thats why I was trying hard with pure arrays. Anyway, below I see the code answerring my quesiton.	[reply]
Re: Merging/Rearranging Tables by dsheroh (Monsignor) on Feb 09, 2007 at 18:12 UTC
If you're on a nix system and don't absolutely have* to do this with perl, the `join` command would take care of this quite simply... Dump the tables to files (sorted on Field1, as `join` does require input files to be sorted), then something like `join -a1 -a2 -t\\| file1 file2` should do it. The main disadvantage of this method is that `join` only works on two files at a time. To do it in perl, your first idea is on the right track, and is one of the fastest/most efficient ways to do this with a couple minor optimizations: - Make a single list of all unique ID values from all tables. As already suggested, a hash would be ideal for this, since it will eliminate any duplicates. - Sort this list and sort all tables by ID. - Walk the list of unique IDs and grab the matching record (if any) from each table. Since everything's sorted, you can just run a single query against each table and only advance a record when a match is found. This gives you a total of two queries against each table, one to collect all IDs at the start, then one to grab the data by ID after that list is built.	[reply] [d/l] [select]
Re^2: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 11, 2007 at 01:21 UTC
Unfortunately I can use only Windows-based PC. That's why I choose Perl. My prototype script is working more or less as you described, except I did not think before about making single Unique IDs list. First I thought about it, when I was writing my question here :) And I was not clever enough to organize final arrangement in one query only. Thanks a lot!	[reply]
Re: Merging/Rearranging Tables by johngg (Canon) on Feb 09, 2007 at 17:58 UTC
add additional string, corresponding to missing IDs to each table This step makes me wonder if you have a defined and ordered list of values that a gene ID might have. This would have a significant bearing on how to approach the problem. Perhaps you could clarify this point. Cheers, JohnGG	[reply]
Re^2: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 11, 2007 at 01:12 UTC
My idea was to bring tables to the same size, as minimum to the same rows number. Than I can sort them, and join together without thinking about difference in the table dimentions.	[reply]
Re: Merging/Rearranging Tables by Cristoforo (Curate) on Feb 13, 2007 at 02:31 UTC
A little late, but wanted to show a slightly different way. Really, much of it follows liverpole's ideas but saves the data as it's string instead of an array of values. Chris #!/usr/bin/perl use strict; use warnings; use File::Basename; use Sort::Naturally; my %data; my $table; my %col_cnt; @ARGV = glob "tab.txt"; while (<>) { next if /^ID/; chomp; ($table = fileparse $ARGV) =~ s/\.txt$// unless $table; $col_cnt{$table} \|\|= tr/\|//; my ($gene, $cols) = split /\s\\|\s*/, $_, 2; $data{$gene}{$table} = $cols; $table = '' if eof; } my @tables = nsort keys %col_cnt; for my $gene (nsort keys %data) { print $gene; for my $table (@tables) { if ($data{$gene}{$table}) { print ' \| ' . $data{$gene}{$table}; } else { print ' \| ' . join " \| ", ("n.a.") x $col_cnt{$table}; } } print "\n"; } [download] Update: Changed from storing the output in a string and waiting until the the entire output was stored and then printed to printing as the program proceeds. Update: Changed from built-in sort to 'nsort' (use Sort::Naturally). If the fields being sorted have part of their name a number exceeding 9,(gene1 gene2 gene9 gene10) or (tab1 tab2...tab11 tab12), they won't sort properly using the default sort.	[reply] [d/l]
Re^2: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 13, 2007 at 14:37 UTC
Hi, thanks for your solution. It is very interesting! Can you comment it a bit? The only minor modification, I would do - keep the header line: `while (<>) { # next if /^ID/; chomp; ($table = fileparse $ARGV) =~ s/\.txt$// unless $table; $col_cnt{$table} \|\|= tr/\|//; my ($gene, $cols) = split /\s\\|\s/, $_, 2; $data{$gene}{$table} = $cols; $table = '' if eof; }` [download]	[reply] [d/l]
Re^3: Merging/Rearranging Tables by Cristoforo (Curate) on Feb 13, 2007 at 17:15 UTC
The only minor modification, I would do - keep the header line That would not process in the code below - it is saving genes and their values - if you want to save the headers the code would need to be rewritten to accomplish that. while (<>) { next if /^ID/; chomp; # if $ARGV = 'tab1.txt', $table will = 'tab1' # unless $table is already initiallized. # $table is uninitiallized on the first read # and after end of file for every file # (see below: $table = '' if eof;) ($table = fileparse $ARGV) =~ s/\.txt$// unless $table; # If the column count for this table is not already # stored, then use the transliteration operator to count # the number of pipes (# of columns) and store it. $col_cnt{$table} \|\|= tr/\|//; my ($gene, $cols) = split /\s\\|\s/, $_, 2; # Here columns are not stored in an array, but in a string # if the data is: # gene 3 \| value 3.1 \| value 3.2 # then $cols would be: # 'value 3.1 \| value 3.2' $data{$gene}{$table} = $cols; $table = '' if eof; } [download] Does this explain better? I suspect that some of your question is about $ARGV and where that comes from and the input (angle) operator (while (<>)) and reading files from @ARGV. Chris	[reply] [d/l]
Re^4: Merging/Rearranging Tables by homeveg (Acolyte) on Feb 16, 2007 at 14:25 UTC
Re^4: Merging/Rearranging Tables by Cristoforo (Curate) on Feb 17, 2007 at 21:17 UTC