Nalababu has asked for the wisdom of the Perl Monks concerning the following question:

How do you compare two columns of data and print the results. The values in one column is repeated so i want to compare with another column and print only once those values that match.
#!/usr/bin/perl -w use strict; use warnings; use Getopt::Std; my $file = "lethal_results2.txt"; # input file containing nubiscan res +ults my $ge = "geneid.txt"; # file with TE name and length open(FILE,"<",$file) or die("Unable to open file $file: $!"); open(GE,"<",$ge) or die("Unable to open $ge: $!"); my @file_data = <FILE>; my @ge_data = <GE>; #input stored in an array close(FILE); close(GE); foreach my $line (@ge_data) { my @line = split(/\s+/,$line); my $start = $line[0]; #print"$line[0] \n"; foreach my $values(@file_data) { my @values = split(/\s+/, $values); my $id = $values[0]; #print"$values[0] \n"; my $width = 11; if ($values[0] eq $line[0]) { #print"the gene id \n"; print"$line[0] \n"; } } }
This code prints all the values just as it is split does not compare and print them as i want it.. The column1 data:
ENSMUSG00000050310 ENSMUSG00000025583 ENSMUSG00000021198 ENSMUSG00000052595 ENSMUSG00000015243 ENSMUSG00000024130 ENSMUSG00000031333 ENSMUSG00000026842 ENSMUSG00000026596 ENSMUSG00000020532 ENSMUSG00000026003 ENSMUSG00000023328 ENSMUSG00000029580 ENSMUSG00000054808 ENSMUSG00000026836
Column 2:
ENSMUSG00000050310 ENSMUSG00000050310 ENSMUSG00000050310 ENSMUSG00000050310 ENSMUSG00000050310 ENSMUSG00000050310 ENSMUSG00000050310 ENSMUSG00000050310 ENSMUSG00000025583 ENSMUSG00000025583 ENSMUSG00000025583 ENSMUSG00000025583 ENSMUSG00000025583
In the second column the values are repeated so i want to compare the values in both columns and then print those present in both only once without repetition.

Replies are listed 'Best First'.
Re: Comparing two columns
by apl (Monsignor) on Jul 10, 2009 at 14:02 UTC
    If there is that much duplication, and that many columns,
    1. write a short script to split the one file (containing the two columns) into two files( A1, A2) of one column each
    2. use your native native sort utility on A1 and A2 to produce two files (B1, B2) containing only unique data
    3. use your native diff utility to do the comparison between B1 and B2
    Never recreate what others have already done for you.
Re: Comparing two columns
by ig (Vicar) on Jul 10, 2009 at 15:52 UTC

    The following makes no assumption about which of your input files has duplicates. It demonstrates the use of a hash of arrays to capture multiple values for each key. You may be able to adapt it to produce the results you need.

    my %data; foreach my $line (@ge_data) { my ($id, $rest) = split(/\s+/,$line,2); push(@{$data{$id}{ge_data}}, $rest); } foreach my $line (@file_data) { my ($id, $rest) = split(/\s+/,$line,2); push(@{$data{$id}{file_data}}, $rest); } foreach my $id (sort keys %data) { next unless(exists $data{$id}{ge_data}); next unless(exists $data{$id}{file_data}); print "$id:\n"; print "\tge_data:\n"; print "\t\t" . join("\t\t",@{$data{$id}{ge_data}}); print "\tfile_data:\n"; print "\t\t" . join("\t\t",@{$data{$id}{file_data}}); }
Re: Comparing two columns
by BioLion (Curate) on Jul 10, 2009 at 11:42 UTC

    Can you post some representative sample data please?

    Although it sounds like you would be better off using a hash for lookups, rather than iterating a whole array for every line - see Tie::File::AsHash.
    If you have multiple entries for a given id, you might be better rolling your own

    push @{ $data{ $split_line[0] } }, $line;

    Just a something something...
      I have posted some sample data.

        The data still isn't really representative though is it!? :P

        Just store the file with the single entries as a hash:

        #!/usr/bin/perl use strict; use warnings; my $first = "single_records.txt"; my $second = "multiple_records.txt"; warn "tie-ing $first...\n"; tie my %hsh, 'Tie::File::AsHash', $first, split => qr/\s+/ or die "Problem tying %hash: $!"; open (my $fh, '<', $second) || die "Failed to open $second : $!" ## al +ways check for success on fh while (<$fh>){ my $line = chomp($_); my ($id,) = split /\s+/, $line; ## capture id ## compare to tied hsh of single records if (exists$hsh{$id}){ print "$line matched $id : $hsh{$id}\n"; } } ## tidy up close $fh || die "Failed to close $second : $!"; untie %hsh;

        Or you can do the other thing i suggestted and hold an array ref of all the lines with a certain id, then print ot all the lines when you see the id in the single record file.

        #!/usr/bin/perl use strict; use warnings; my $first = "single_records.txt"; my $second = "multiple_records.txt"; open (my $fh, '<', $second) || die "Failed to open $second : $!" ## al +ways check for success on fh my %second_records = (); while (<$fh>){ my $line = $_; my ($id,) = split /\s+/, $line; ## capture id push @{ $second_records{$id} }, $line; } ## tidy up close $fh || die "Failed to close $second : $!"; ## open single record file and compare open ($fh, '<', $first) || die "Failed to open $first : $!"; while (<$fh>){ my $line = chomp( $_ ); my ($id,) = split /\s+/, $line; ## capture id if ( exists$second_records{$id} ){ print (join '', "$line matches records:\n", @{ $second_records{$id +} }); } } close $fh || die "Failed to close $first : $!";

        Hope this helps.

        Just a something something...
Re: Comparing two columns
by mzedeler (Pilgrim) on Jul 10, 2009 at 20:23 UTC

    This snipplet just prints common keys in the two files file1.txt and file2.txt:

    se strict; use warnings; sub get_file { open my $FILE, '<', shift or die $!; return map {chop; $_ => $_} <$FILE>; } my %a = get_file 'file1.txt'; my %b = get_file 'file2.txt'; { no warnings 'uninitialized'; print "$_\n" for grep {$_} @a{keys %b}; }

    The above can easilly be done using sort, uniq and grep.

Re: Comparing two columns
by perliff (Monk) on Jul 11, 2009 at 09:05 UTC
    You can use the native unix sort and comm commands if you have the column data in separate files.

    e.g. file1 contains column1, file2 contains column2 data, type the following on the command line.

    sort file1 > file1.sorted sort file2 > file2.sorted comm -12 file1.sorted file2.sorted | sort | uniq > result.txt
    The result.txt file contains the values that are present in both files. (we remove the repeating lines with the sort and uniq) or you can use it inside the program, like this
    system ("sort file2 > file2.sorted"); system ("sort file1 > file1.sorted"); my $result = `comm -12 file1.sorted file2.sorted | sort | uniq`;
    Now the $result contains the identifiers that are common to both files but are not repeated.

    perliff

    ----------------------

    -with perl on my side