Re: Removing partially duplicated lines from a file
by perldigious (Priest) on Jul 26, 2016 at 16:23 UTC
|
#!/usr/bin/perl
use warnings;
use strict;
open(my $in_fh, '<', 'input.txt') or die $!;
open(my $out_fh, '>', 'output.txt') or die $!;
my %seen_lines;
while (<$in_fh>)
{
chomp;
my @columns = split;
if ($columns[1] and $columns[1] =~ /^HLA-A/)
{
my $HLA_Peptide = $columns[1] . $columns[2];
print $out_fh "$_\n" if (!exists $seen_lines{$HLA_Peptide});
$seen_lines{$HLA_Peptide} = 1;
}
else
{
print $out_fh "$_\n";
}
}
close $out_fh;
close $in_fh;
I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
| [reply] [d/l] |
|
|
Thank you. This works well, but I dont understand all your code. For example, why do we need to say
if ($columns[1] and $columns[1] =~ /^HLA-A/)
e.g. with the same reference used twice? Also, I would like to send the output to a variable and not print to a file. I know this should seem like a minor change to your great code, but I can't seem to make it work. Could you help please? (My novice level skills are showing)
| [reply] [d/l] |
|
|
The line of code you asked about basically says if $columns[1] is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed use warnings; would end up complaining for any line that didn't have an element at index 1 in $columns. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines.
As for the code changes you requested:
#!/usr/bin/perl
use warnings;
use strict;
open(my $in_fh, '<', 'input.txt') or die $!;
my $output;
my %seen_lines;
while (<$in_fh>)
{
chomp;
my @columns = split;
if ($columns[1] and $columns[1] =~ /^HLA-A/)
{
my $HLA_Peptide = $columns[1] . $columns[2];
$output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide});
$seen_lines{$HLA_Peptide} = 1;
}
else
{
$output .= "$_\n";
}
}
close $in_fh;
print $output;
EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease use warnings; or "-w". I wonder if there is something like exists which I use a lot for hashes only meant for use to check if an array element exists?
I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
Okay. I am commenting here just because I thought of another way to solve this problem. What if you sort the lines before you try to eliminate the duplicates? That way the same lines will fall right next to each other, and you can just skip them by comparing this line to the previous line. If the two are the same, then you can skip that because it's a duplicate. This is a good idea if you don't expect to have a lot of duplicate lines and you plan to sort the output later on. Might as well sort it now and eliminate the duplicates in one step. ;-)
use strict;
use warnings;
my $ff = 'robots.txt';
my $fh;
my @lines;
# Read the entire file and
# store lines in an array
open $fh, "<", $ff or die "Sorry, can't open file - $ff\n";
{
local $/;
@lines = split("\n", <$fh>);
}
close $fh;
# Get rid of duplicate lines
@lines = sort(@lines);
my $L;
my $prev = '';
foreach $L (@lines)
{
print($L . "\n") if ($prev ne $L);
$prev = $L;
}
| [reply] [d/l] |
Re: Removing partially duplicated lines from a file
by NetWallah (Canon) on Jul 26, 2016 at 16:39 UTC
|
Would this one-liner suffice ?:
>perl -ane "next if $seen{$F[1].$F[2]}++;print" InputFile.txt
----------------------------------------------------------------------
+-------------
Pos HLA Peptide Core Of Gp Gl Ip Il Ic
+ore Identity Score Aff(nM) %Rank BindLevel
117 HLA-A*11:01 YVNVNMGLK YVNVNMGLK 0 0 0 0 0 YVNVNM
+GLK GQ924620_HBe_C_ 0.62268 59.3 0.40 <= SB
28 HLA-A*11:01 WGMDIDPYK WGMDIDPYK 0 0 0 0 0 WGMDID
+PYK GQ924620_HBe_C_ 0.44617 400.4 1.60 <= WB
133 HLA-A*11:01 HISCLTFGR HISCLTFGR 0 0 0 0 0 HISCLT
+FGR GQ924620_HBe_C_ 0.43660 444.0 1.70 <= WB
47 HLA-A*02:05 YVNVNMGLK FLPSDFFPS 0 0 0 0 0 FLPSDF
+FPS X02763_HBe_A_po 0.77090 11.9 0.08 <= SB
40 HLA-A*02:05 ATVELLSFL ATVELLSFL 0 0 0 0 0 ATVELL
+SFL X02763_HBe_A_po 0.75279 14.5 0.10 <= SB
1 HLA-A*02:05 MQLFHLCLI MQLFHLCLI 0 0 0 0 0 MQLFHL
+CLI X02763_HBe_A_po 0.66669 36.8 0.30 <= SB
9 HLA-A*02:05 IISCTCPTV IISCTCPTV 0 0 0 0 0 IISCTC
+PTV X02763_HBe_A_po 0.52206 176.1 1.40 <= WB
147 HLA-A*02:05 YLVSFGVWI YLVSFGVWI 0 0 0 0 0 YLVSFG
+VWI X02763_HBe_A_po 0.51724 185.5 1.40 <= WB
55 HLA-A*02:05 SVRDLLDTA SVRDLLDTA 0 0 0 0 0 SVRDLL
+DTA X02763_HBe_A_po 0.49966 224.4 1.70 <= WB
114 HLA-A*02:05 VVNYVNTNV VVNYVNTNV 0 0 0 0 0 VVNYVN
+TNV X02763_HBe_A_po 0.48729 256.6 1.80 <= WB
93 HLA-A*02:05 ELMTLATWV ELMTLATWV 0 0 0 0 0 ELMTLA
+TWV X02763_HBe_A_po 0.46686 320.0 2.50
8 HLA-A*02:05 LIISCTCPT LIISCTCPT 0 0 0 0 0 LIISCT
+CPT X02763_HBe_A_po 0.45053 381.9 2.50
117 HLA-A*11:01 IISCTCPTV YVNVNMGLK 0 0 0 0 0 YVNVNM
+GLK AB219428_HBe_B_ 0.62268 59.3 0.40 <= SB
UPDATE: If you want to maintain the headers, try this:
perl -ane "next unless not $seen{$F[1].$F[2]}++ or m/^\s*\D/;print" I
+nputFile.txt
"Software interprets lawyers as damage, and routes around them" - Larry Wall
| [reply] [d/l] [select] |
Re: Removing partially duplicated lines from a file
by Mandrake (Chaplain) on Jul 27, 2016 at 09:19 UTC
|
If I understood the question, something like the below should work.
#!/bin/perl -w
use strict;
open TMP, "input.txt" || die ("could not open the input file \n");
my (%hash, @columns);
for (<TMP>) {
chomp;
@columns = split ;
next unless ($columns[1] && $columns[2]);
if (not exists $hash{$columns[1].$columns[2]}) {
$hash{$columns[1].$columns[2]}=1;
print $_."\n" ;
}
}
close TMP;
| [reply] [d/l] |
Re: Removing partially duplicated lines from a file
by Anonymous Monk on Jul 26, 2016 at 15:53 UTC
|
I am a bit flummoxed, having tried and failed with regex! What did you try?
| [reply] |
|
|
sub removeDuplicatesFromOutputText {
my $origfile = $_[0]; # eg. HLA-A_0_HBe_for_8_sids.txt;
my %hTmp;
my $outfile;
my $tempout;
open (IN, $origfile);
while (my $line = <IN>) {
if ($line =~ /^\s+\d+/){
next if $line =~ m/^\s*$/;
}
$line=~s/^\s+//;
$line=~s/\s+$//;
$tempout = qq{$line\n} unless ($hTmp{$line}++);
$outfile .= $tempout;
}
return $outfile;
}
| [reply] [d/l] |