Re: How to match duplicate lines in a text file and extract only one of those lines to a new file

This does the job:

my %data;

while (<DATA>)
{
    chomp;
    
    my ($firstnum, $secondnum, $thingy, @bits) = split /\s/;    
    my $key = sprintf("%s\x00%s\x00%s", $firstnum, $secondnum, $thingy
+);

    for my $i (0 .. $#bits)
    {
        $data{$key}[$i] = [] unless exists $data{$key}[$i];
        push @{ $data{$key}[$i] }, $bits[$i];
    }
}

foreach my $key (sort keys %data)
{
    print join q[ ], split "\x00", $key;
    print q[ ];
    print join q[ ], map { join '/', @$_ } @{ $data{$key} };    
    print "\n";
}

__DATA__
1 51 Brahui A C A A T 
1 51 Brahui A C A G T 
3 51 Brahui A C A G C 
3 51 Brahui A C G A T 
5 51 Brahui A C G A T 
5 51 Brahui A C G G C 
7 51 Brahui A C G A T 
7 51 Brahui A C G G T 
9 51 Brahui A C G G T 
9 51 Brahui A C G G T
[download]

But don't just copy that as-is. Try to understand how it works. What you want to look at is:

"I/O Operators" in perlop.
split, join and map - see perlfunc.
perllol to teach you about nested data structures.

perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Comment on Re: How to match duplicate lines in a text file and extract only one of those lines to a new file Select or Download Code

Replies are listed 'Best First'.
Re^2: How to match duplicate lines in a text file and extract only one of those lines to a new file by danica (Initiate) on Apr 04, 2012 at 13:26 UTC
Hiya, Thank you so much for your help, I tried to run your code just to see how it works. One thing I noticed when I look at the output is that the first column doesn't seem to get transformed. Some duplicates also seem to have been missed. Like so: 1 Brahui A C/C A/A A/G T/T 100 Hazara A C G A T C C 100 Hazara G C A A T C T 102 Hazara A C/C G/G A/G	[reply]
Re^3: How to match duplicate lines in a text file and extract only one of those lines to a new file by aaron_baugher (Curate) on Apr 04, 2012 at 14:37 UTC
In your original sample data, every line began with two integers and then a text string. Now you seem to be running it on lines that begin with a single integer and a text string, so his code is picking up the first allele as part of the duplicated section. Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply]
Re^4: How to match duplicate lines in a text file and extract only one of those lines to a new file by danica (Initiate) on Apr 05, 2012 at 09:26 UTC
Oh yes of course! Thank you for pointing out such an obvious mistake!	[reply]