lkperl has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

First of all thank you for your help. I use strict in my script, but this doesn't appear here.

I have a text file with sentences, one per line. A perl script then extracts each sentence and stores them in a temporary array (@temp). Then I use the following code to extract the duplicates.

@duplicates = grep $seen{$_}++, @temp; # here we count how many times each duplicate appears %seen = (); foreach my $item (@duplicates) { $seen{$item}++; } @unique_duplicates = keys %seen;

Everything works fine.

Now I'd like to make my code object-oriented. The script basically reads each sentence and create a new object:

$record=Entry->new(); $record->id(1); $record->duplicate(0); $record->src("This is a duplicate."); push @records, $record;

Here we have additionnally an ID, a flag 'duplicate' set to 0 and the sentence.

I push the $record to an array for later use. And this is basically where my problems start. I'm used to work with arrays and hashes, but here we have each array element being a hash:

print @records

gives you

Entry=HASH(0x183efe8)Entry=HASH(0x1835288) etc.

I have lots of sentences to process (up to 500'000), so that 's why I prefer to avoid to many loops and extract the duplicates in one pass.

So to summarize, I'd need to identify the duplicates and set the flag 'duplicate' to the number of times the duplicate sentence appears.

Before doing the object-oriented code, the approach was simple. But here it is getting more complicated.

I thank you for your help.

Larry

Replies are listed 'Best First'.
Re: objects and duplicates
by wfsp (Abbot) on Apr 27, 2008 at 17:49 UTC
    Each element of @records is an Enter object. You have some setter methods ($record->id(1)) now all you need are some getter methods. Loop over the array and call each getter. This just shows the id attribute.
    #!/usr/local/bin/perl use strict; use warnings; my @records; my $record = Enter->new; $record->set_id(1); push @records, $record; for my $rec (@records){ printf qq{id: %s\n}, $rec->get_id; } package Enter; sub new { my ($class) = @_; my $self = {}; bless $self, $class; return $self; } sub set_id { my ($self, $id) = @_; $self->{id} = $id; } sub get_id { return shift->{id}; }
Re: objects and duplicates
by stiller (Friar) on Apr 27, 2008 at 18:20 UTC
    First, you have a bug in the ordering of $seen{$_} ... a few lines before you do %seen = ();

    You can reduce the amount of work (and code) by doing:

    my %seen; $seen{$_}++ for @temp;

    Now you have one entry in %seen for each sentence, and you know how many times each sentence occured. Now you can use wfsp's package to make the objects.
      Wfsp and stiller, thank you both. It worked. I now use
      for $record (@records){ $duplicates{$_}++ for $record->src; }

      to store each sentence in a hash, with the number of times it appears. This is great.

      I still need to change the $record->duplicate of each object to the number of times the sentence appears. Do I have to write a new loop for this, or can we do it at the same time we count the duplicate sentences?

      I was thinking of something like this:

      for $record (@records){ $duplicates{$_}++ for $record->src; $record->duplicate++; }

      Thank you

        Does $record->src return a string or a list? If it returns a string you'll want:
        my %count; for my $record (@records) { if ($count{$record->src}++) { $record->duplicate(1); # or however you set the duplicate flag } }
        Note: if a string is duplicated, the first Entry object with that src value will not have its duplicate flag set but all matching Entry objects will.
        Just jot down the pseudo code: e.g.:
        • read the file, each line into a hash, incrementing number of occurences of that sentence.
        • create an object from each sentence, for which I already know the number of occurences...
        • and so on
        hth
Re: objects and duplicates
by dragonchild (Archbishop) on Apr 27, 2008 at 20:45 UTC
    Why do you want to make your code OO? Most programs in the world are not OO and do just fine. What benefit do you have to keeping information about the sentence in one place?

    As for your problem, you want to use overload, specifically stringify (or '""') and cmp. That way, your objects behave as you'd expect them to when you do something if ( $records[10] eq $records[20] ) { print "Sentence is '$records[10]'\n" }


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?