walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a CSV file that I need to process. I don't have any code to really post, because I can't think of how to start this one... Anyway, I will post a sample from the CSV file, and what I am trying to get.


Sample:

,,,,
Group: ASDF_LKJ_SanFrancisco_Group(S),,SanFrancisco_Group(S),,
,Users (9),LNUM,AccountName,
,,54090,dlosa,
,,54090,crownroy,
,,1915,jackdan,
,,1315,makersmar,
,,1265,whisgin,
,,2415,cdasde,
,,54090,sfafdf,
,,54090,asdfdsf,
,,54090,afdasdfa,
,,,,
Group: ASDF_LKJ_Phoenix_Group(S),,Phoenix_Group(S),,
,Users (1),LNUM,AccountName,
,,36203,test_haackk,
,,,,
Group: ASDF_LKJ_NewYork_Group(S),,NewYork_Group(S),,
,Users (5),LNUM,AccountName,
,,7997,barb,
,,64081,cant,
,,64081,chan,
,,48293,chanjas,
,,64081,nish,
,,,,
Group: ASDF_LKJ_Portland_Group(S),,Portland_Group(S),,
,Users (3),LNUM,AccountName,
,,,Test_K,
,,,Test_l,
,,,Test_x,
As you can see, each seperate set is seperated by a header that starts with Group: and the next line that starts with ,Users.

I am trying to just print each of those out to a file, then count the numbers underneath and find the most common occurence. It would save the most common number and the output would look something like this:


Desired Output: ,,,,
Group: ASDF_LKJ_SanFrancisco_Group(S),,SanFrancisco_Group(S),,
,Users (9),LNUM,AccountName,
54090
,,,,
Group: ASDF_LKJ_Phoenix_Group(S),,Phoenix_Group(S),,
,Users (1),LNUM,AccountName,
36203
,,,,
Group: ASDF_LKJ_NewYork_Group(S),,ASNewYork_Group(S),,
,Users (5),LNUM,AccountName,
64081
Group: ASDF_LKJ_Portland_Group(S),,Portland_Group(S),,
,Users (3),LNUM,AccountName,



Basically, I just want to find the most commonly used numbers under each header, and print it out, discarding the other numbers.

Replies are listed 'Best First'.
Re: Counting Random Elements
by broomduster (Priest) on Sep 03, 2008 at 22:34 UTC
    You have probably read this on PM somewhere else, but I will repeat it in case you missed it: You are very likely to get help with code you post, and not so likely to get someone to write it for you.

    So how about some code? Try something. See how well it works (or not). Then post it along with an explanation of what you really want it to do and how your expectations are not being met.

    And see How (Not) to Ask a Question for tips on what folks here expect to see when you ask for help.

      I understand what you're saying, and I hope it does not seem like I am asking anyone to write me code. I am just trying to figure out where to start, and was asking some help there. I read the CSV file into an array and read it line-by-line, but not sure how to count occurence of the numbers, nor how to break them into sets based on the headers.

        You should almost never "read the file into an array". In general for large files that leads to poor performance.

        In the particular case of CSV files you should not process them line by line! CSV generally allows line breaks within fields so a row of data may span more than one line. Not a common case I grant you, but the Text::xSV module correctly handles that case along with the more common case of a line is a row.

        To solve any programming problem it helps to start with a high level sketch of how things are going to go so even coding up what you have described and using comments to indicate where work needs to be done would be considered as a good starting point.


        Perl reduces RSI - it saves typing
Re: Counting Random Elements
by GrandFather (Saint) on Sep 03, 2008 at 22:54 UTC

    CSV of course immediately triggers "use a csv module", so that part is easy: Text::xSV or one of the CSV modules.

    "most common occurence" should trigger "hash for uniqueness". Consider:

    my @items = qw(foo poot bar foo baz poot foo); my %counts; ++$counts{$_} for @items; print "$_: $counts{$_}\n" for sort keys %counts;

    The trick then is to figure out how to process groups coming at you a row at a time. There are a bundle of ways of doing that. A simple way is to perform the processing for the previous group each time you see a new group header (unless there was no previous group):

    use strict; my @rowTypes = (1, 2, 3, 3, 3, 1, 2, 3, 3, 1, 2, 1, 3); my $count; for my $rowType (@rowTypes) { if ($rowType == 1) { # New group next unless defined $count; # No processing for first or empty + group print "$count "; $count = 0; # Reset count. Set to undef to skip empty group next; } ++$count if $rowType == 3; } print "$count " if defined $count; # Need to process the last group

    Perl reduces RSI - it saves typing
Re: Counting Random Elements
by apl (Monsignor) on Sep 04, 2008 at 01:18 UTC
    Don't worry about writing Perl. Solve your problem by hand, and write down the steps you took. Write it in English, or in pseudo-code.

    Once you work the problem by hand, and explain it to someone else (even if that someone else is you), you'll be better able to start coding your solution.
Re: Counting Random Elements
by oko1 (Deacon) on Sep 03, 2008 at 22:48 UTC
    > I don't have any code to really post, because I can't think of how to start this one...
    

    <hint type="broad"> Try looping over your input file, collecting the data in a hash, and analyzing it.</hint>

    If you then post the code here, and explain exactly where you're getting stuck (this assumes that all your questions haven't yet been answered - although effort is usually the thing that resolves questions), you'll find a number of capable people willing to help.

    Please note that "help" here on PerlMonks is not defined as "we'll do it for you". It is, usually, defined as "you do it, and we'll assist as necessary."

    
    -- 
    Human history becomes more and more a race between education and catastrophe. -- HG Wells
    
Re: Counting Random Elements
by gone2015 (Deacon) on Sep 03, 2008 at 23:48 UTC

    OK, so you need to identify the start and end of each group, and count the number of times within each group that each number occurs, and print out the most popular ?

    What do you want to do if several are equally popular ?

    Are those blank lines before each group header, or blank columns at the start of each group header ?

    You want to read the file, line by line. Unless the CSV is guaranteed simple, you should consider using CSV module to parse it for you. Anything involving recognising strings, probably you want to use a hash for. (With numbers you could use an array, but if the number range is bit or sparse, use a hash anyway.) Then unpeel the hash to extract the collected data.

    So... something along the lines of:

    use strict ; use warnings ; use List::Util qw(max) ; open my $STUFF, "<", "stuff.csv" or die "$!" ; my %number_ids = () ; while (my $line = <$STUFF>) { $line =~ s/\s+\z// ; $line =~ s/\A\s+// ; my @csv = split(/\s*\,\s*/, $line) ; if (defined($csv[2]) && ($csv[2] =~ m/^\d+$/)) { $number_ids{$csv[2]}++ ; } elsif (defined($csv[4]) && ($csv[4] =~ m/^group\:/i)) { show_most_popular(\%number_ids) ; print $line, "\n" ; %number_ids = () ; } else { # OK, so what do we do with peculiar lines ? } ; } ; show_most_popular(\%number_ids) ; sub show_most_popular { my ($r_ids) = @_ ; return if !%$r_ids ; my $max = max(values %$r_ids) ; my @popular = () ; while (my ($id, $count) = each %$r_ids) { if ($count == $max) { push @popular, $id ; } ; } ; print join(',', sort @popular), "\n" ; } ;
    appears to do what you asked for.

    Various bits of Perl magic:

    • $line =~ s/\s+\z// will remove any trailing whitespace, including any form of newline known to me. This is more general than chomp IMHO.
    • $line =~ s/\A\s+// smacks any leading whitespace.
    • my @csv = split(/\s*\,\s*/, $line) smacks any whitespace around the separators.
    • $number_ids{$csv[2]}++ -- if the entry for the id already exists, increment it. Otherwise, create the entry, undefined, and increment, which gives 1.
    • worth being clear that although the field looks numeric, it's a string at this point. That means that ids '123' and '0000123' would be different.
    • show_most_popular(\%number_ids) passes the hash by reference, to be tidy.
    • <code>return if !%$r_ids<code> -- returns if the hash referred to by $r_ids is empty (OK, so that's obvious)

    OK, so I was at a loose end.

    Update: I should also have noted that the code is assuming that the 'Group:' lines include the apparently floating ',,,,'. Also, that when knocking out this kind of thing -- scan a file to recognise and pick out stuff to process -- it is important to trap stuff that hasn't been recognised; the odd's are that you'll find bugs either in the code or in your understanding of the data.