FIJI42 has asked for the wisdom of the Perl Monks concerning the following question:

New to Perl and was wondering if anyone could provide suggestions, relevant examples or resources regarding a coding problem I'm having below. So I have two data files with tab-delineated columns, similar to the example below.

File#1: GeneID ColA ColB Gene01 5 15 Gene02 4 8 Gene03 25 5 File#2: GeneID ColA ColC Gene01 12 3 Gene03 5 20 Gene05 22 40 Gene06 88 2

The actual files I'm using have >50 columns and rows, but are similar to what's above. First, I want to input the files, establish variables holding the column names for each file, and establish hashes using the column 1 genes as keys and the concatenated values of the other 2 columns per key. This way there is one key per one value in each row of the hash. My trouble is the third hash %commongenes. I need to find the keys that are the same in both hashes and use just those keys, and their associated values in both files, in the third hash. In the above example, this would be the following key value pairs:

File1: File2: Gene01 5 15 Gene01 12 3 Gene03 25 5 Gene03 5 20

I know the following if loop is incorrect, yet concatenation of columns from both files (similar to below) is similar in form to what I'd like to have.

if ($tmpArray1[0] eq $tmpArray2[0]){ $commongenes{$tmpArray2[0]} = $tmpArray1[1].':'.$tmpArray1[2].':'.$tmpArray2[1].':'.$tmpArray2[2 +]; }

Here is the main body of the code below:

#!/usr/bin/perl -w use strict; my $file1=$ARGV[0]; my $file2=$ARGV[1]; open (FILE1, "<$file1") or die "Cannot open $file1 for processing!\n" +; open (FILE2, "<$file2") or die "Cannot opent $file2 for processing!\n +"; my @fileLine1=<FILE1>; my @fileLine2=<FILE2>; my %file1_allgenes=(); my %file2_allgenes=(); my %commongenes =(); my ($file1_group0name, $file1_group1name, $file1_group2name)=('','',' +',''); my ($file2_group0name, $file2_group1name, $file2_group2name)=('','',' +',''); for (my $i=0; $i<=$#fileLine1 && $i<=$#fileLine2; $i++) { chomp($fileLine1[$i]); chomp($fileLine2[$i]); my @tmpArray1=split('\t',$fileLine1[$i]); my @tmpArray2=split('\t',$fileLine2[$i]); if ($i==0) { ## Column Names and/or Letters $file1_group0name=substr($tmpArray1[0],0,6); $file1_group1name=substr($tmpArray1[1],0,4); $file1_group2name=substr($tmpArray1[2],0,4); $file2_group0name=substr($tmpArray2[0],0,6); $file2_group1name=substr($tmpArray2[1],0,4); $file2_group2name=substr($tmpArray2[2],0,4); } if ($i!=0) { ## Concatenated values in 3 separate hashes + if (! defined $file1_allgenes{$tmpArray1[0]}) { $file1_allgenes{$tmpArray1[0]}=$tmpArray1[1].':'.$tmpArray1[2] +; } if (! defined $file2_allgenes{$tmpArray2[0]}) { $file2_allgenes{$tmpArray2[0]}=$tmpArray2[1].':'.$tmpArray2[2] +; } if ($tmpArray1[0] eq $tmpArray2[0]){ $commongenes{$tmpArray2[0]} = $tmpArray1[1].':'.$tmpArray1[2].':'.$tmpArray2[1].':'.$tmpArray2[2 +]; } } my @commongenes = %commongenes; print "@commongenes\n\n"; }

Replies are listed 'Best First'.
Re: Matching hash keys from different hashes and utilizing in new hash
by Laurent_R (Canon) on Oct 21, 2017 at 21:12 UTC
    Hi FIJI42,

    if I understand well, you're looking for records having the same identifier (same first column) of file 1 and file 2 and want to output the data of common records.

    This can be much simpler.

    Start by reading the first file, store the data into a hash. Then read the second file line by line; if you find the identifier of a record of file 2 in the hash containing the data of file 1, then output it with the desired format. Something like that (untested because there is not enough sample data):

    #!/usr/bin/perl use strict; use warnings; my ($file1, $file2) = @ARGV; my %hash_file1; open my $FILE1, "<", $file1 or die "Cannot open $file1 for processing! +\n"; while (my $line = <$FILE1>) { my ($key, @fields) = split /\s+/, $line; $hash_file1{$key} = join ":", @fields; } close $FILE1; open my $FILE2, "<", $file2 or die "Cannot open $file2 for processing! +\n"; while (my $line = <$FILE2>) { my ($key, @fields) = split /\s+/, $line; my $rest_of_line = join ":", @fields; if (exists $hash_file1{$key}) { # this is a common record +(same identifier) print $key, ":", $hash_file1{$key}, ":", $rest_of_line, "\n"; } } close $FILE2;
    BTW, this should probably work with many more columns in your file.

      This method sortof backdoors the "header-row" in that it assumes that $key in both "header-rows" are the same and they dont duplicate another valid $key in the data area.

      just saying :) ("Once bitten, twice shy")

        Well, yes, maybe, but that's what I understand from the OP.

        I am doing this kind of processing (albeit usually much more complicated) all the time, but it very frequently (almost always) follows a remove duplicates step. Here, we don't know enough about input data.

      Thanks, this was very helpful. I forgot to add that I wanted to make a new hash with only the common keys, and their associate column values (concatenated), but I believe I've got it. The reason for doing so was to split the columns apart in a subroutine I have for comparing the column values per key.

        Then you can just populate your new hash at the place near the end of the code where there is the print statement.

        But maybe you don't even need to populate a new hash since, at this point in the code, you have the two keys and the two strings representing the other columns; so you could quite probably make the comparison (or call the subroutine making the comparison) just there, instead of the print statement.

Re: Matching hash keys from different hashes and utilizing in new hash
by kcott (Archbishop) on Oct 22, 2017 at 06:32 UTC

    G'day FIJI42,

    Welcome to the Monastery.

    Unfortunately, ">50 columns and rows" is somewhat vague: ">50 (columns plus rows in total)"? ">50 columns and >50 rows"? something else?; also, 51 and 51,000,000 satisfy >50. In addition, providing an actual Perl data structure, for wanted or expected results, gives a much clearer picture of what you are trying to achieve.

    That said, here's the technique I might have used:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Text::CSV; use Data::Dump; die "Usage: $0 file1 file2" unless @ARGV == 2; my ($file1, $file2) = @ARGV; my $csv = Text::CSV::->new({sep_char => "\t"}); my $gene_data_1 = get_gene_data($file1, $csv); my $gene_data_2 = get_gene_data($file2, $csv); my %gene_common; for (keys %$gene_data_1) { next unless exists $gene_data_2->{$_}; push @{$gene_common{$_}}, $gene_data_1->{$_}, $gene_data_2->{$_}; } dd $gene_data_1; dd $gene_data_2; dd \%gene_common; sub get_gene_data { my ($file, $csv) = @_; my %data; open my $fh, '<', $file; my $header = $csv->getline($fh); my @cols = @$header[1 .. $#$header]; while (my $row = $csv->getline($fh)) { @{$data{$row->[0]}}{@cols} = @$row[1 .. $#$row]; } return \%data; }

    Which outputs:

    { Gene01 => { ColA => 5, ColB => 15 }, Gene02 => { ColA => 4, ColB => 8 }, Gene03 => { ColA => 25, ColB => 5 }, } { Gene01 => { ColA => 12, ColC => 3 }, Gene03 => { ColA => 5, ColC => 20 }, Gene05 => { ColA => 22, ColC => 40 }, Gene06 => { ColA => 88, ColC => 2 }, } { Gene01 => [{ ColA => 5, ColB => 15 }, { ColA => 12, ColC => 3 }], Gene03 => [{ ColA => 25, ColB => 5 }, { ColA => 5, ColC => 20 }], }

    Notes (bearing in mind your "New to Perl" comment):

    • You used strict. That's great, keep doing that.
    • You used the "-w" switch. That's less great. Prefer the warnings pragma as recommended in "perlrun: -w".
    • Consider using the autodie pragma. It saves having to handcraft your own '... or die "reason";' code, which is tedious, easy to forget or get wrong and, as such, error-prone; for instance, your current messages say what is wrong ("Can't open") but not why ("non-existent file"? "insufficient privileges"? "something else"?).
    • Text::CSV is the best module to use for this (or similar, comma-separated, pipe-separated, etc.) type of data. It's already solved the types of problems typically encountered: it's not a wheel you need to reinvent. If you also have Text::CSV_XS installed, it will run faster.
    • Data::Dump is only used to show the results: it's not part of the solution.
    • Use the 3-argument form of open with lexical (my) filehandles. See that documentation for details.
    • Use subroutines to abstract similar functions.
    • See "perldata: Slices" for information on array slices (e.g. @$header[1 .. $#$header]) and hash slices (e.g. @{$data{$row->[0]}}{@cols}).

    — Ken

Re: Matching hash keys from different hashes and utilizing in new hash
by afoken (Chancellor) on Oct 21, 2017 at 21:17 UTC

    How about reading the tables into a database and using SQL instead? Your files look closely enough to CSV, so you better use Text::CSV and especially Text::CSV_XS for reading instead of manual parsing. Add DBI and DBD::SQLite and you have a performant, serverless database. Part one of your program would read the CSV files and write them into the SQLite database. Or, even easier but slower, use DBI and DBD::CSV (that sits on top of Text::CSV) to make your CSV files appear as tables in a relational database. Part two would just query the database.

    Update: Why a database? Because it can easily handle input files significantly larger than your available RAM. With pure hashes, you are limited by available RAM. You don't have to use SQLite, but it is a good start for tests. If things grow bigger, I would recommend using PostgreSQL. If you have a commercial RDBMS around (Oracle, MS SQL Server, ...), you may as well use that.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      How about reading the tables into a database and using SQL instead? ... Or, even easier but slower, ...
      Yes, you could use a database. This might even be the best solution if the files are truly huge and can't fit into memory.

      But if speed matters (and assuming the data is not too large for the available memory), the hash solution I suggested would be completed long before the data is stored into the database and you even start to query the database.

Re: Matching hash keys from different hashes and utilizing in new hash
by choroba (Cardinal) on Oct 22, 2017 at 00:05 UTC
    Crossposted to StackOverflow, where I provided a solution that unearthed a warnings bug in older Perl versions. Note that it's considered polite to inform about crossposting to avoid duplicate work of people not attending both sites.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,