in reply to Re: removing duplicates lines plus strings from a file
in thread removing duplicates lines plus strings from a file

I tried doing the following ---
my($url,$name,$text,@lines,$key,$value,$line); my($Docs) = "temp.txt"; ###start of comment #temp.txt is of the format --- #name1@url1@text1 #name1@url1@text1 #name1@url1@text11 #name2@url2@text2 #name2@url2@text21 #name3#url3@text3,etc... ##end of comment my %file_hash; open (FILE, $Docs); @lines = <FILE>; close FILE; foreach $line (@lines) { chomp($line); ($name,$url,$text) = split('@',$line); chomp($url); $key = $url; $value = $line; $file_hash{$key} = $value; } open (OUT, ">$Docs"); for my $key (keys %file_hash) { print OUT "$file_hash{$key}\n"; } close OUT; }

my guess to what is happening here is that since I am using the hash - i am actually storing the last match to the $key and I am not getting what I want as the OUT file
the OUT file that I need is ---
name1@url1@text1
name2@url2@text2
name3#url3@text3,etc...
thank you for your patience........

Replies are listed 'Best First'.
Re^3: removing duplicates lines plus strings from a file
by graff (Chancellor) on Sep 19, 2011 at 01:17 UTC
    Seems like you're pretty close, but you're making it harder than it needs to be. Also, it sounds like you want to output just the first line that is found for a given url, but your code will always preserve the last line for a given url.

    Here's a simpler version that preserves the first line -- I'm including the input data with the script (you'll be reading it from a file), and I'm just printing the output to STDOUT (you can open an output file, or just redirect stdout when you run the script):

    use strict; use warnings; my %file_hash; my @lines = <DATA>; for my $line (@lines) { chomp($line); my ($name,$url,$text) = split('@',$line); $file_hash{$url} = $line unless ( $file_hash{$url} ); } for my $key ( sort keys %file_hash ) { print "$file_hash{$key}\n"; } __DATA__ name1@url1@text1 name1@url1@text1 name1@url1@text11 name2@url2@text2 name2@url2@text21 name3@url3@text3
    (BTW, this sort of thing would normally be done using while(<>) to read and process the input data one line at a time, rather than reading the whole file into memory and then looping over it with for my $line (@lines) -- but it's not a big deal in this case.)

    You might be interested in looking at this command-line utility: col-uniq -- remove lines that match on selected column(s). It would produce the output you want from your particular input file like this:

    col-uniq -d '@' -c 2 input.file > output.file
    But in order for that to work, you'd need to make sure the input data was sorted according to the url field, whereas the input doesn't need to be sorted for the snippet above to work.

      thank you for your comments ... it works perfectly....just wanted to ask one more thing.. is it possible do the same using regex matching... will it be able to save the original ordering of the "Lines" that is lost by using Hash..

        You would use regex matching if you were trying to spot "partial duplications" in the column of interest, but then you have a lot more work to do to specify what qualifies as "duplication", and to condition the data to meet the spec -- e.g. if you want "foo@http://url1@bar" to be considered a duplicate of "foo@https://url1@bar2", then you would use a regex to eliminate the irrelevant differences.

        As for preserving original line ordering, the "col-uniq" utility does that, at the expense of preserving duplicate values when the input hasn't been sorted in advance according to the column of interest.