Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Find duplicate elements from an array

by tej (Scribe)
on Nov 22, 2010 at 02:29 UTC ( [id://872863]=perlquestion: print w/replies, xml ) Need Help??

tej has asked for the wisdom of the Perl Monks concerning the following question:

I am working on an array in which each entry contains name and page numbar.

I want to match duplicate names and remove second one. However i want to add page number of the second element in front of 1st one

E.g. array contains elements like: Barmparas, Galinos, 1120

Barmparas, Galinos, 1410

Beal, Alan, 866

Beall, Dawson, 1062

I want to chage array to: Barmparas, Galinos, 1120, 1410

Beal, Alan, 866

Beall, Dawson, 1062

Can anybody please help me?

Thank you

Replies are listed 'Best First'.
Re: Find duplicate elements from an array
by PeterPeiGuo (Hermit) on Nov 22, 2010 at 02:36 UTC

    You have to show effort. Here is some tips:

    Hash is perfect in this case. Use the "entry" as the key, and put the page numbers in an array. So we are talking about a hash of arrays here.

    Traverse through the file line by line, and use a simple regex to separate 'entry" from the page number.

    Peter (Guo) Pei

Re: Find duplicate elements from an array
by LanX (Saint) on Nov 22, 2010 at 04:51 UTC
    You're most likely using the wrong data structure. Better try a hash of arrays...

    Cheers Rolf

      Yep, HoA is what you need, keyed by name, simply unshift (if you want the second page number before the first - you specify this, but the example doesn't back it up) the page number onto the beginning of the array:

      use strict; use warnings; use Data::Dumper; my %HoA = (); while (<DATA>) { if (/(.+)\s+(\d+)\s*$/) { unshift(@{$HoA{$1}}, $2); } } print Dumper(\%HoA); __DATA__ Barmparas, Galinos, 1120 Barmparas, Galinos, 1410 Beal, Alan, 866 Beall, Dawson, 1062

      Output:

      $VAR1 = { 'Beal, Alan,' => [ '866' ], 'Beall, Dawson,' => [ '1062' ], 'Barmparas, Galinos,' => [ '1410', '1120' ] };
        Thank you so much.. mjscott2702
Re: Find duplicate elements from an array
by Anonymous Monk on Nov 22, 2010 at 03:16 UTC

      I dont want to remove duplicate elements directly. I need some information from them. Using hash will help me removing duplicate elements, but what about page numbers? How can i add them at appropriate place?

Re: Find duplicate elements from an array
by sundialsvc4 (Abbot) on Nov 22, 2010 at 13:47 UTC

    When I see, “match duplicate names,” and assuming a reasonable data-size, I immediately think of hash tables.   A hash whose key is the name, and whose value is an arrayref ... pointing to a list of page numbers (or, if you are also worried about duplicates there, another hashref.

    If you have large data volumes, there is another excellent approach ... one that pre-dates the digital computer.   That approach is:   sorting.   If you sort the data-file, first by name and then by page-number (an operation which can be done “unexpectedly fast,” even for huge files), the resulting files have the following useful characteristics:

    1. All occurrences of any particular key-value will be adjacent.   So, you can detect “gaps” without searching.
    2. Entirely-duplicate records are also adjacent, so you can detect them by comparing the current record to the preceding one.
    3. If you need to compare, merge, diff, etc. two arbitrarily-large files, simply sort them both the same way and the problem becomes trivial.
    4. Most people want their output reports nicely sorted ... and presto, you’ve got that.
    5. No matter how much data you’ve got, it all takes a constant amount of memory and a linear amount of time.   (Excluding the sort-process itself, which might well be faster than the rest of it.)

      If you have large data volumes, there is another excellent approach ... one that pre-dates the digital computer. That approach is: sorting.
      Wait, you're saying if the data set is large, instead of using an algorithm with a linear expected running time, you opt for one that's Ω(N log N)?
      (Excluding the sort-process itself, which might well be faster than the rest of it.)
      The larger the dataset is, the more dominant the sorting will be. Even if it's written in the best optimized way possible, running in C, if the dataset is large enough, Ω(N log N) will take more time than a linear algorithm that's driven by turtles.

      Not my kind of advice.

        When data sizes become large, the fact that “memory is virtual” becomes extremely critical.   When a system begins to “thrash,” actual running-times rise exponentially ... a phenomenon with an obvious “elbow” in the time-curve which is not-so-affectionately known as “hitting the wall.”   (Or, if you prefer, “the windshield.”   As in, “one of those big ugly bugs with very colorful and juicy guts that are specially evolved to direct themselves, at the moment of their demise, directly and precisely at the middle of the driver’s side of the glass, about halfway up.”   Or maybe a migrating swarm of them, and yours is the lucky automobile.   Ugh.)

        An algorithm based on external disk-sorting, then, will avoid that issue.   The sort-program can be relied upon to use available memory efficiently, without overwhelming that resource when the file becomes very large.   The programs that are subsequently written to use these known-to-be identically-sorted files are, as I said, “trivial.”

        It’s so important that there are companies (http://www.syncsort.com) that do nothing else but “sort huge things faster than anyone else knows how to do.”   “When you need them, you will pay them.”

        The same strategy also shines when you are otherwise dealing with very large random-access disk files.   Each access to such a file might entail one or more “seeks,” and this purely-physical consideration can drive completion times into the dungeon.   An algorithm based on sorting can operate hundreds, or even thousands(!) of times faster.

        It won’t shine ... it may well be more-costly ... when one of these resource-driven constraints do not exist; as is often the case.   Is this, then, “a silver-bullet strategy?”   No, it is somewhat of a pain in the asterisk.   Memory is so plentiful and cheap these days that it is often a non-issue (although this is also why a program that seems to be very successful in system-acceptance testing, falls flat on its asterisk in production).   This is a strategy that you need to be familiar with.   “When you need it, you will know.”

Re: Find duplicate elements from an array
by JavaFan (Canon) on Nov 22, 2010 at 14:21 UTC
    $"=", ";/\d+$/,say"$`@{$`{$`}}"for grep{/\d+$/;push@{$`{$`}},$&;@{$`{$ +`}}==1}<>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://872863]
Approved by graff
Front-paged by aquarium
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (1)
As of 2024-04-17 00:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found