in reply to Re: Dear Monks
in thread Dear Monks

This node falls below the community's minimum standard of quality and will not be displayed.

Replies are listed 'Best First'.
Re^3: Dear Monks
by Limbic~Region (Chancellor) on Mar 16, 2011 at 13:49 UTC
    sivaraman,
    But i want the output like

    Then you did a poor job of explaining your problem. I can see other potential things you haven't told us like order. Do you want the most frequent items first? Should the country codes (assuming US, GB and UK are country codes) be listed in alphabetical order on a line or should they be in the order they appeared in the file or does it matter at all?

    In your first example, you had two lines that were 'abcd,US' and your desired output was '2 abcd US'. What would you have wanted if the input was

    abcd,US abcd,US abcd,GB

    Based off your poor description so far, I would guess '3 abcd US, GB' but I wonder if it is important to know that 2 of the 3 came from US?

    In general, hashes allow you to keep track of distinct items (the abcd portion) and arrays help you maintain order. When you do a better job of explaining what you want, someone will likely jump in and provide a solution but you should be able to start exploring bart's solution on your own as well.

    Cheers - L~R

Re^3: Dear Monks
by raybies (Chaplain) on Mar 16, 2011 at 15:00 UTC
    I agree your original question is wrong, if you wanted output like that, but here's how I did it: (you're lucky I was bored this morning...)
    use strict; my %hh; #read in... while (<DATA>) { my ($k, $v) = split /\s*,\s*/; chomp $v; $hh{$k}{$v}++; } #now to print out... foreach my $k (sort {$a cmp $b} keys %hh) { my $tlist = ''; my $tally = 0; foreach my $v (sort {$a cmp $b} keys %{$hh{$k}}) { $tally += $hh{$k}{$v}; $tlist .= "$v, "; } chop $tlist; chop $tlist; #remove trailing ', ' print "$tally\t$k\t$tlist\n"; } __DATA__ abcd, GB abcd, UK abcd, US addd, US
    (hopefully there aren't any typos. I wrote/tested on another machine and typed in by hand...)

    Update: fixed a typo in code "== split" should be "= split";

      Dear Monk,

      I am extremely sorry for not described the problem clearly. Your previous suggestion is really helpful. Here our file size is more than 30million, so it throws the Out of memory exception. Kindly suggest me that, how to resolve this issue. Thank you in advance.

        sivaraman,

        Thank you for recognizing that your lack of description is leading us into providing inadequate solutions. You also have to recognize that changing requirements (5 million to 30 million is a significant difference), will also lead to us wasting time and energy.

        You still have done an inadequate job of describing all of the requirements in order for us to provide a solution that meets your requirements. Each time a monk provides a new solution based on your "it didn't work because X", you reply with "that didn't work because Y". In order for us to help, you need to define all the parameters of the problem first. I mentioned a number of things in Re^3: Dear Monks. Since your current issue seems to be memory, consider a few more: What operating system? How much physical memory? Is perl 32 or 64 bit?

        There is a relatively simple solution if order doesn't matter but since you haven't removed that as a constraint for us - it is rather difficult to guess what will satisfy your unwritten requirements.

        Additionally, you have to understand that PerlMonks is not a free script writing service. We expect you to show effort. If you don't know where the perl documentation is - please see Perldoc online though a local copy was probably installed and available from the command line. With that said, I would be happy to provide a solution to you once you do a better job at describing the requirements but I am not going to keep guessing with "try this".

        Cheers - L~R

        Have a look at DB_File or AnyDBM_File ... you will need to store the hash on disk instead of in the memory if it's to big to fit there.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

        If I were you, then I'd sort the file using a system utility like the linux sort command (as sundial and others have suggested). Then all your "abcd" would already be grouped. You could then print them as you encountered them, and only track one "abcd" symbol at a time, writing back to a file or to the screen each time the symbol changed.

        if your data took this form:

        abcd, GS abcd, GT abcd, HI abcd, HI abcd, UK abcd, US abce, AK abce, AZ abce, GB abcf, UT abcf, US

        Can you see how you'd not need to keep track of every symbol (abcd, abce, abcf) all in one hash at once? You could simply read a line at a time, and tally them appropriately, and everytime you noticed that you were no longer reading abcd, but now some different symbol, you'd just need to reinitialize in a loop a new tracking set...

        Why don't you try that, and if you still can't get it, come back and ask more questions. Good Luck... --Ray