objects and duplicates

lkperl has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

First of all thank you for your help. I use strict in my script, but this doesn't appear here.

I have a text file with sentences, one per line. A perl script then extracts each sentence and stores them in a temporary array (@temp). Then I use the following code to extract the duplicates.

@duplicates = grep $seen{$_}++, @temp;

# here we count how many times each duplicate appears
%seen = ();
    
foreach my $item (@duplicates) {
    $seen{$item}++;
}

@unique_duplicates = keys %seen;
[download]

Everything works fine.

Now I'd like to make my code object-oriented. The script basically reads each sentence and create a new object:

$record=Entry->new();
$record->id(1);
$record->duplicate(0);
$record->src("This is a duplicate.");

push @records, $record;
[download]

Here we have additionnally an ID, a flag 'duplicate' set to 0 and the sentence.

I push the $record to an array for later use. And this is basically where my problems start. I'm used to work with arrays and hashes, but here we have each array element being a hash:

print @records

gives you

Entry=HASH(0x183efe8)Entry=HASH(0x1835288) etc.

I have lots of sentences to process (up to 500'000), so that 's why I prefer to avoid to many loops and extract the duplicates in one pass.

So to summarize, I'd need to identify the duplicates and set the flag 'duplicate' to the number of times the duplicate sentence appears.

Before doing the object-oriented code, the approach was simple. But here it is getting more complicated.

I thank you for your help.

Larry

Comment on objects and duplicates Select or Download Code

Replies are listed 'Best First'.
Re: objects and duplicates by wfsp (Abbot) on Apr 27, 2008 at 17:49 UTC
Each element of `@records` is an `Enter` object. You have some setter methods (`$record->id(1)`) now all you need are some getter methods. Loop over the array and call each getter. This just shows the `id` attribute. `#!/usr/local/bin/perl use strict; use warnings; my @records; my $record = Enter->new; $record->set_id(1); push @records, $record; for my $rec (@records){ printf qq{id: %s\n}, $rec->get_id; } package Enter; sub new { my ($class) = @_; my $self = {}; bless $self, $class; return $self; } sub set_id { my ($self, $id) = @_; $self->{id} = $id; } sub get_id { return shift->{id}; }` [download]	[reply] [d/l] [select]
Re: objects and duplicates by stiller (Friar) on Apr 27, 2008 at 18:20 UTC
First, you have a bug in the ordering of `$seen{$_} ...` a few lines before you do `%seen = ();` You can reduce the amount of work (and code) by doing: `my %seen; $seen{$_}++ for @temp;` [download] Now you have one entry in %seen for each sentence, and you know how many times each sentence occured. Now you can use wfsp's package to make the objects.	[reply] [d/l] [select]
Re^2: objects and duplicates by Anonymous Monk on Apr 27, 2008 at 18:45 UTC
Wfsp and stiller, thank you both. It worked. I now use `for $record (@records){ $duplicates{$_}++ for $record->src; }` [download] to store each sentence in a hash, with the number of times it appears. This is great. I still need to change the $record->duplicate of each object to the number of times the sentence appears. Do I have to write a new loop for this, or can we do it at the same time we count the duplicate sentences? I was thinking of something like this: `for $record (@records){ $duplicates{$_}++ for $record->src; $record->duplicate++; }` [download] Thank you	[reply] [d/l] [select]
Re^3: objects and duplicates by pc88mxer (Vicar) on Apr 27, 2008 at 18:50 UTC
Does `$record->src` return a string or a list? If it returns a string you'll want: `my %count; for my $record (@records) { if ($count{$record->src}++) { $record->duplicate(1); # or however you set the duplicate flag } }` [download] Note: if a string is duplicated, the first `Entry` object with that `src` value will not have its `duplicate` flag set but all matching `Entry` objects will.	[reply] [d/l] [select]
Re^4: objects and duplicates by lkperl (Initiate) on Apr 27, 2008 at 19:19 UTC
Re^5: objects and duplicates by pc88mxer (Vicar) on Apr 27, 2008 at 19:50 UTC
Re^3: objects and duplicates by stiller (Friar) on Apr 27, 2008 at 18:56 UTC
Just jot down the pseudo code: e.g.: read the file, each line into a hash, incrementing number of occurences of that sentence. create an object from each sentence, for which I already know the number of occurences... and so on hth	[reply]
Re: objects and duplicates by dragonchild (Archbishop) on Apr 27, 2008 at 20:45 UTC
Why do you want to make your code OO? Most programs in the world are not OO and do just fine. What benefit do you have to keeping information about the sentence in one place? As for your problem, you want to use overload, specifically stringify (or '""') and cmp. That way, your objects behave as you'd expect them to when you do something `if ( $records[10] eq $records[20] ) { print "Sentence is '$records[10]'\n" }` My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l]