removing duplicates lines plus strings from a file

kirpy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: removing duplicates lines plus strings from a file by Perlbotics (Archbishop) on Sep 18, 2011 at 10:46 UTC
If you already have some code, show us please. If not, try the following recipe and if that fails, feel free to come back with a code sample that troubles you. In case your only problem is to remove duplicate URLs, the following recipe might be helpful: open file for reading ⁽¹⁾ read file line by line, while extract URL (hint: chomp, split, perlre) print ⁽²⁾ line if URL has never been encountered (hint: perlfaq4) remember that URL has now been seen (hint: `$seen{$url}`) ^(1,2): Open another file for writing and print to filehandle unless output to STDOUT is sufficient. Update: Or perform a Super Search with this query for more inspiration... Update: In response to code presented below: Fix1: Change `$file_hash{$key} = $value;` to `$file_hash{$key} //= $value;` to save first match only. Fix2: Change `for my $key (keys %file_hash)` to `for my $key (sort keys %file_hash)` to potentially recover original order of entries. Better: Have a look at the original hint again. It allows to process the file without saving the whole contents to memory which is an advantage when processing huge files. Extra-Hint: perltidy	[reply] [d/l] [select]
Re^2: removing duplicates lines plus strings from a file by kirpy (Initiate) on Sep 18, 2011 at 20:20 UTC
I tried doing the following --- my($url,$name,$text,@lines,$key,$value,$line); my($Docs) = "temp.txt"; #temp.txt is of the format --- #name1@url1@text1 #name1@url1@text1 #name1@url1@text11 #name2@url2@text2 #name2@url2@text21 #name3#url3@text3,etc... my %file_hash; open (FILE, $Docs); @lines = <FILE>; close FILE; foreach $line (@lines) { chomp($line); ($name,$url,$text) = split('@',$line); chomp($url); $key = $url; $value = $line; $file_hash{$key} = $value; } open (OUT, ">$Docs"); for my $key (keys %file_hash) { print OUT "$file_hash{$key}\n"; } close OUT; } [download] my guess to what is happening here is that since I am using the hash - i am actually storing the last match to the $key and I am not getting what I want as the OUT file the OUT file that I need is --- #name1@url1@text1 #name2@url2@text2 #name3#url3@text3,etc... thank you all for your patience........	[reply] [d/l]
Re^2: removing duplicates lines plus strings from a file by kirpy (Initiate) on Sep 19, 2011 at 06:47 UTC
thank you for your comments (..hints).. I got that..& it works perfectly....just wanted to ask one more thing.. is it possible do the same using regex matching... will it be able to save the original ordering of the "Lines" that is lost by using Hash..	[reply]
Re: removing duplicates lines plus strings from a file by zentara (Cardinal) on Sep 18, 2011 at 11:16 UTC
Need advice on hashes and methods and removing repeated elements from one array has some code to get you started. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^2: removing duplicates lines plus strings from a file by kirpy (Initiate) on Sep 18, 2011 at 20:35 UTC
I tried doing the following --- my($url,$name,$text,@lines,$key,$value,$line); my($Docs) = "temp.txt"; ###start of comment #temp.txt is of the format --- #name1@url1@text1 #name1@url1@text1 #name1@url1@text11 #name2@url2@text2 #name2@url2@text21 #name3#url3@text3,etc... ##end of comment my %file_hash; open (FILE, $Docs); @lines = <FILE>; close FILE; foreach $line (@lines) { chomp($line); ($name,$url,$text) = split('@',$line); chomp($url); $key = $url; $value = $line; $file_hash{$key} = $value; } open (OUT, ">$Docs"); for my $key (keys %file_hash) { print OUT "$file_hash{$key}\n"; } close OUT; } [download] my guess to what is happening here is that since I am using the hash - i am actually storing the last match to the $key and I am not getting what I want as the OUT file the OUT file that I need is --- name1@url1@text1 name2@url2@text2 name3#url3@text3,etc... thank you for your patience........	[reply] [d/l]
Re^3: removing duplicates lines plus strings from a file by graff (Chancellor) on Sep 19, 2011 at 01:17 UTC
Seems like you're pretty close, but you're making it harder than it needs to be. Also, it sounds like you want to output just the first line that is found for a given url, but your code will always preserve the last line for a given url. Here's a simpler version that preserves the first line -- I'm including the input data with the script (you'll be reading it from a file), and I'm just printing the output to STDOUT (you can open an output file, or just redirect stdout when you run the script): `use strict; use warnings; my %file_hash; my @lines = <DATA>; for my $line (@lines) { chomp($line); my ($name,$url,$text) = split('@',$line); $file_hash{$url} = $line unless ( $file_hash{$url} ); } for my $key ( sort keys %file_hash ) { print "$file_hash{$key}\n"; } __DATA__ name1@url1@text1 name1@url1@text1 name1@url1@text11 name2@url2@text2 name2@url2@text21 name3@url3@text3` [download] (BTW, this sort of thing would normally be done using `while(<>)` to read and process the input data one line at a time, rather than reading the whole file into memory and then looping over it with `for my $line (@lines)` -- but it's not a big deal in this case.) You might be interested in looking at this command-line utility: col-uniq -- remove lines that match on selected column(s). It would produce the output you want from your particular input file like this: `col-uniq -d '@' -c 2 input.file > output.file` [download] But in order for that to work, you'd need to make sure the input data was sorted according to the url field, whereas the input doesn't need to be sorted for the snippet above to work.	[reply] [d/l] [select]
Re^4: removing duplicates lines plus strings from a file by kirpy (Initiate) on Sep 19, 2011 at 06:52 UTC
Re^5: removing duplicates lines plus strings from a file by graff (Chancellor) on Sep 19, 2011 at 10:53 UTC
Re: removing duplicates lines plus strings from a file by pvaldes (Chaplain) on Sep 18, 2011 at 11:37 UTC
remove duplicated lines mmmh maybe like this... untested but you can try `use File::Copy; open my $file, '<' "myfile"; open my $file2 '>', "/path/to/mynewfile"; copy ($file, $file2); my $file2 =~ s/^([[:alnum:]]+)$\n{1}$1/$1/g;` [download]	[reply] [d/l]