Re^3: removing duplicates lines plus strings from a file

Seems like you're pretty close, but you're making it harder than it needs to be. Also, it sounds like you want to output just the first line that is found for a given url, but your code will always preserve the last line for a given url.

Here's a simpler version that preserves the first line -- I'm including the input data with the script (you'll be reading it from a file), and I'm just printing the output to STDOUT (you can open an output file, or just redirect stdout when you run the script):

use strict;
use warnings;

my %file_hash;
my @lines = <DATA>;
    
for my $line (@lines)
{
    chomp($line);
    my ($name,$url,$text) = split('@',$line);
    $file_hash{$url} = $line unless ( $file_hash{$url} );
}

for my $key ( sort keys %file_hash )
{
    print "$file_hash{$key}\n";
}

__DATA__
name1@url1@text1
name1@url1@text1
name1@url1@text11
name2@url2@text2
name2@url2@text21
name3@url3@text3
[download]

(BTW, this sort of thing would normally be done using while(<>) to read and process the input data one line at a time, rather than reading the whole file into memory and then looping over it with for my $line (@lines) -- but it's not a big deal in this case.)

You might be interested in looking at this command-line utility: col-uniq -- remove lines that match on selected column(s). It would produce the output you want from your particular input file like this:

col-uniq -d '@' -c 2  input.file > output.file
[download]

But in order for that to work, you'd need to make sure the input data was sorted according to the url field, whereas the input doesn't need to be sorted for the snippet above to work.

Comment on Re^3: removing duplicates lines plus strings from a file Select or Download Code

Replies are listed 'Best First'.
Re^4: removing duplicates lines plus strings from a file by kirpy (Initiate) on Sep 19, 2011 at 06:52 UTC
thank you for your comments ... it works perfectly....just wanted to ask one more thing.. is it possible do the same using regex matching... will it be able to save the original ordering of the "Lines" that is lost by using Hash..	[reply]
Re^5: removing duplicates lines plus strings from a file by graff (Chancellor) on Sep 19, 2011 at 10:53 UTC
You would use regex matching if you were trying to spot "partial duplications" in the column of interest, but then you have a lot more work to do to specify what qualifies as "duplication", and to condition the data to meet the spec -- e.g. if you want "foo@http://url1@bar" to be considered a duplicate of "foo@https://url1@bar2", then you would use a regex to eliminate the irrelevant differences. As for preserving original line ordering, the "col-uniq" utility does that, at the expense of preserving duplicate values when the input hasn't been sorted in advance according to the column of interest.	[reply]