Regular Expression to find duplicate text blocks

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
My program generates the files below, and my question is how can I check if the program is adding duplicated blocks like this:

@:1107532758::1
lll
Lets
mad.gif
[download]

twice, and if it is, I need to find them and delete it, I am trying using regular expressions without luck, I need some help!
Thanks very much!

@:1107530184::1
kkkkkkkkkkkkmkmkmk
kkkkkk
confused.gif
@:1107530257:1107530439:1
kmkmkm
<br>kmkmkm
<br>
<br>Fri Feb  4 10:17:37 2005
<br>
mad.gif
@:1107530709::1
ygyg
ygygygyg
lol.gif
@:1107530717::1
ygyg
ygygygyg
lol.gif
@:1107530963::1
cool
help
cool.gif
@:1107532649:1107532689:1
k
<br>kkkkkkkkkkkkkkkkk
<br>
<br>Fri Feb  4 10:57:29 2005
<br>
lol
lol.gif
@:1107532758::1
lll
Lets
mad.gif
@:1107532976::1
lll
Lets
mad.gif
[download]

2005-02-05 Edited by Arunbear: Changed title from 'Regular Expression', as per Monastery guidelines

Comment on Regular Expression to find duplicate text blocks Select or Download Code

Replies are listed 'Best First'.
Re: Regular Expression to find duplicate text blocks by Random_Walk (Prior) on Feb 04, 2005 at 17:00 UTC
A more normal way to test for duplicates is to use a hash. You can set the input record seperator to @: to make this easy. There are a couple of ways to go about this depending on if order is important to you, if you need to know when a duplicate was removed and if the entire file can be read into memory at one time. `# here is an example to get you started in untested code. my ($in, $out, %hash); open $in, "<", $file_name; # you set the name somewhere else open $out, ">", $out_file; local $/ = "@:"; # set the input record seperator while (<$fh>) { if ($hash{$_}++) { # will be undef first time then >0 print "Warning duplicate found\n$_"; } else { print $out $_ } }` [download] If the file is too big to hold in the hash then perhaps use the first line of each record as a key and store the ofset in the outfile where it can be found. If you find that key again read the one you wrote back to see if the entire record matches. Of course the value under the key would be an array of ofsets to allow multiple differing records with the same first line. If you control the writing program perhaps you can better fix that not to write duplicte records. You would also get more out of the Monastery if you had a quick look here How do I post a question effectively? Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l]
Re^2: Regular Expression to find duplicate text blocks by Art_XIV (Hermit) on Feb 04, 2005 at 21:40 UTC
Random Walk is quite right... This is just to further demonstrate the case where the file is too big to hold in memory. If each of your 'blocks' begin with an `@:1107532976::1`-like pattern, and each of those are expected to be unique, then just focus on the lines that contain that pattern: use strict; use warnings; my %entries = (); while (<DATA>) { chomp; if (/^\@:\d+/) { #if line starts with aht-colon-digit(s) if (exists $entries{$_}) { #$_ is the current line #$. is the current line number print "Line $. duplicates line $entries{$_} [$_]\n"; } else { $entries{$_} = $.; } } } 1; __DATA__ @:1107530184::1 kkkkkkkkkkkkmkmkmk kkkkkk confused.gif @:1107530257:1107530439:1 kmkmkm <br>kmkmkm <br> <br>Fri Feb 4 10:17:37 2005 <br> mad.gif @:1107530709::1 ygyg ygygygyg lol.gif @:1107530717::1 ygyg ygygygyg lol.gif @:1107530963::1 cool help cool.gif @:1107532649:1107532689:1 k <br>kkkkkkkkkkkkkkkkk <br> <br>Fri Feb 4 10:57:29 2005 <br> lol lol.gif @:1107530257:1107530439:1 kmkmkm <br>kmkmkm <br> <br>Fri Feb 4 10:17:37 2005 <br> mad.gif @:1107532758::1 lll Lets mad.gif @:1107532976::1 lll Lets mad.gif [download] Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"	[reply] [d/l] [select]
Re: Regular Expression to find duplicate text blocks by Tanktalus (Canon) on Feb 04, 2005 at 16:16 UTC
Just out of curiosity - what regular expression have you already tried? What is the code you've got so far to attempt this?	[reply]
Re: Regular Expression to find duplicate text blocks by jdalbec (Deacon) on Feb 05, 2005 at 15:46 UTC
I'm going to assume that the timestamp lines are unique and that duplicate blocks don't have to have the same timestamps. #! /usr/bin/perl use strict; use warnings; my ($key, %hash, @keys); while (<DATA>) { if (/^\@:\d{10,}:(?:\d{10,})?:1$/) { chomp; $key = $_; # needed to process keys in file order push @keys, $key; } else { $hash{$key} .= $_; } } my ($value, %revhash); # changed to foreach to process keys in file order # while(($key, $value) = each %hash) { foreach $key (@keys) { $value = $hash{$key}; if(exists($revhash{$value})){ print "$key is a duplicate of $revhash{$value}\n"; } else { $revhash{$value} = $key; } } __DATA__ @:1107530184::1 kkkkkkkkkkkkmkmkmk kkkkkk confused.gif @:1107530257:1107530439:1 kmkmkm <br>kmkmkm <br> <br>Fri Feb 4 10:17:37 2005 <br> mad.gif @:1107530709::1 ygyg ygygygyg lol.gif @:1107530717::1 ygyg ygygygyg lol.gif @:1107530963::1 cool help cool.gif @:1107532649:1107532689:1 k <br>kkkkkkkkkkkkkkkkk <br> <br>Fri Feb 4 10:57:29 2005 <br> lol lol.gif @:1107532758::1 lll Lets mad.gif @:1107532976::1 lll Lets mad.gif [download] And the output is: `@:1107530717::1 is a duplicate of @:1107530709::1 @:1107532976::1 is a duplicate of @:1107532758::1` [download] Update: If you only want to eliminate one block from each pair of consecutive duplicate blocks then this might work: `undef $/; my $file = <DATA>; $file =~ s/^\@:\d{10,}:(?:\d{10,})?:1\n(.*?) (^\@:\d{10,}:(?:\d{10,})?:1\n\1)/$2/msxg; print $file;` [download]	[reply] [d/l] [select]