in reply to Re^2: removing duplicates lines plus strings from a file
in thread removing duplicates lines plus strings from a file

Seems like you're pretty close, but you're making it harder than it needs to be. Also, it sounds like you want to output just the first line that is found for a given url, but your code will always preserve the last line for a given url.

Here's a simpler version that preserves the first line -- I'm including the input data with the script (you'll be reading it from a file), and I'm just printing the output to STDOUT (you can open an output file, or just redirect stdout when you run the script):

use strict; use warnings; my %file_hash; my @lines = <DATA>; for my $line (@lines) { chomp($line); my ($name,$url,$text) = split('@',$line); $file_hash{$url} = $line unless ( $file_hash{$url} ); } for my $key ( sort keys %file_hash ) { print "$file_hash{$key}\n"; } __DATA__ name1@url1@text1 name1@url1@text1 name1@url1@text11 name2@url2@text2 name2@url2@text21 name3@url3@text3
(BTW, this sort of thing would normally be done using while(<>) to read and process the input data one line at a time, rather than reading the whole file into memory and then looping over it with for my $line (@lines) -- but it's not a big deal in this case.)

You might be interested in looking at this command-line utility: col-uniq -- remove lines that match on selected column(s). It would produce the output you want from your particular input file like this:

col-uniq -d '@' -c 2 input.file > output.file
But in order for that to work, you'd need to make sure the input data was sorted according to the url field, whereas the input doesn't need to be sorted for the snippet above to work.

Replies are listed 'Best First'.
Re^4: removing duplicates lines plus strings from a file
by kirpy (Initiate) on Sep 19, 2011 at 06:52 UTC

    thank you for your comments ... it works perfectly....just wanted to ask one more thing.. is it possible do the same using regex matching... will it be able to save the original ordering of the "Lines" that is lost by using Hash..

      You would use regex matching if you were trying to spot "partial duplications" in the column of interest, but then you have a lot more work to do to specify what qualifies as "duplication", and to condition the data to meet the spec -- e.g. if you want "foo@http://url1@bar" to be considered a duplicate of "foo@https://url1@bar2", then you would use a regex to eliminate the irrelevant differences.

      As for preserving original line ordering, the "col-uniq" utility does that, at the expense of preserving duplicate values when the input hasn't been sorted in advance according to the column of interest.