kirpy has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file -- say temp.txt -in the format--

name1@url1@text1
name1@url1@text1
name1@url1@text2
name2@url2@text2
name3@url3@text3
name4@url4@text4, etc

I need to remove the duplicate files - say line 1 and line 2, and also remove line 3 since it has the same url1 as line 1hence the file that is required now is ---

name1@url1@text1
name2@url2@text2
name3@url3@text3
name4@url4@text4, etc
any assistance shall be appreciated..
  • Comment on removing duplicates lines plus strings from a file

Replies are listed 'Best First'.
Re: removing duplicates lines plus strings from a file
by Perlbotics (Archbishop) on Sep 18, 2011 at 10:46 UTC

    If you already have some code, show us please. If not, try the following recipe and if that fails, feel free to come back with a code sample that troubles you.

    In case your only problem is to remove duplicate URLs, the following recipe might be helpful:

    • open file for reading (1)
    • read file line by line, while
      • extract URL (hint: chomp, split, perlre)
      • print (2) line if URL has never been encountered (hint: perlfaq4)
      • remember that URL has now been seen (hint: $seen{$url})
    (1,2): Open another file for writing and print to filehandle unless output to STDOUT is sufficient.

    Update: Or perform a Super Search with this query for more inspiration...

    Update: In response to code presented below:

    • Fix1: Change $file_hash{$key} = $value; to $file_hash{$key} //= $value; to save first match only.
    • Fix2: Change for my $key (keys %file_hash) to for my $key (sort keys %file_hash) to potentially recover original order of entries.
    • Better: Have a look at the original hint again. It allows to process the file without saving the whole contents to memory which is an advantage when processing huge files.
    • Extra-Hint: perltidy

      I tried doing the following ---
      my($url,$name,$text,@lines,$key,$value,$line); my($Docs) = "temp.txt"; #temp.txt is of the format --- #name1@url1@text1 #name1@url1@text1 #name1@url1@text11 #name2@url2@text2 #name2@url2@text21 #name3#url3@text3,etc... my %file_hash; open (FILE, $Docs); @lines = <FILE>; close FILE; foreach $line (@lines) { chomp($line); ($name,$url,$text) = split('@',$line); chomp($url); $key = $url; $value = $line; $file_hash{$key} = $value; } open (OUT, ">$Docs"); for my $key (keys %file_hash) { print OUT "$file_hash{$key}\n"; } close OUT; }

      my guess to what is happening here is that since I am using the hash - i am actually storing the last match to the $key and I am not getting what I want as the OUT file
      the OUT file that I need is ---
      #name1@url1@text1
      #name2@url2@text2
      #name3#url3@text3,etc...
      thank you all for your patience........

      thank you for your comments (..hints).. I got that..& it works perfectly....just wanted to ask one more thing.. is it possible do the same using regex matching... will it be able to save the original ordering of the "Lines" that is lost by using Hash..

Re: removing duplicates lines plus strings from a file
by zentara (Cardinal) on Sep 18, 2011 at 11:16 UTC
      I tried doing the following ---
      my($url,$name,$text,@lines,$key,$value,$line); my($Docs) = "temp.txt"; ###start of comment #temp.txt is of the format --- #name1@url1@text1 #name1@url1@text1 #name1@url1@text11 #name2@url2@text2 #name2@url2@text21 #name3#url3@text3,etc... ##end of comment my %file_hash; open (FILE, $Docs); @lines = <FILE>; close FILE; foreach $line (@lines) { chomp($line); ($name,$url,$text) = split('@',$line); chomp($url); $key = $url; $value = $line; $file_hash{$key} = $value; } open (OUT, ">$Docs"); for my $key (keys %file_hash) { print OUT "$file_hash{$key}\n"; } close OUT; }

      my guess to what is happening here is that since I am using the hash - i am actually storing the last match to the $key and I am not getting what I want as the OUT file
      the OUT file that I need is ---
      name1@url1@text1
      name2@url2@text2
      name3#url3@text3,etc...
      thank you for your patience........
        Seems like you're pretty close, but you're making it harder than it needs to be. Also, it sounds like you want to output just the first line that is found for a given url, but your code will always preserve the last line for a given url.

        Here's a simpler version that preserves the first line -- I'm including the input data with the script (you'll be reading it from a file), and I'm just printing the output to STDOUT (you can open an output file, or just redirect stdout when you run the script):

        use strict; use warnings; my %file_hash; my @lines = <DATA>; for my $line (@lines) { chomp($line); my ($name,$url,$text) = split('@',$line); $file_hash{$url} = $line unless ( $file_hash{$url} ); } for my $key ( sort keys %file_hash ) { print "$file_hash{$key}\n"; } __DATA__ name1@url1@text1 name1@url1@text1 name1@url1@text11 name2@url2@text2 name2@url2@text21 name3@url3@text3
        (BTW, this sort of thing would normally be done using while(<>) to read and process the input data one line at a time, rather than reading the whole file into memory and then looping over it with for my $line (@lines) -- but it's not a big deal in this case.)

        You might be interested in looking at this command-line utility: col-uniq -- remove lines that match on selected column(s). It would produce the output you want from your particular input file like this:

        col-uniq -d '@' -c 2 input.file > output.file
        But in order for that to work, you'd need to make sure the input data was sorted according to the url field, whereas the input doesn't need to be sorted for the snippet above to work.
Re: removing duplicates lines plus strings from a file
by pvaldes (Chaplain) on Sep 18, 2011 at 11:37 UTC

    remove duplicated lines

    mmmh maybe like this... untested but you can try

    use File::Copy; open my $file, '<' "myfile"; open my $file2 '>', "/path/to/mynewfile"; copy ($file, $file2); my $file2 =~ s/^([[:alnum:]]+)$\n{1}$1/$1/g;