meredib has asked for the wisdom of the Perl Monks concerning the following question:

I have a spreadsheet where I’m looking for consistent data. The csv file is over 7,000 lines and it would take me to long to look through each line, and I need to look at it daily. What I’m looking for is a way to limit the data, but I have no clue where to start. The only thing I could thing of would be removing rows where there where a specific number of zeros are in the row, but that’s not going to help me in the long run. The data looks like:
JOBINIT,2,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655,0,0,0,0,0,0,0,0,0,0 INVPSWD,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,192,440,192,440,188,450,180,450,189,447,192,440,192,440,188,45 +0,152,438,0,0,36,400,192,440 PWDEXPR,4,222,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,0,0,0,0,0,0,0,0,0,0,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655 UNDFUSER,1,0,1,0,3,4199,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
The Output that I would want would be:
INSAUTH,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655,0,0,0,0,0,0,0,0,0,0 INSAUTH,192,440,192,440,188,450,180,450,189,447,192,440,192,440,188,45 +0,152,438,0,0,36,400,192,440 INSAUTH,0,0,0,0,0,0,0,0,0,0,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655
I really just looking for some kind of idea for how to go about this, or if you have something already created I would like to look at it. Thank you for all your help.

Replies are listed 'Best First'.
Re: Remove array if data inconsistent
by moritz (Cardinal) on Apr 22, 2009 at 14:11 UTC
    The only thing I could thing of would be removing rows where there where a specific number of zeros are in the row, but that’s not going to help me in the long run.

    So what are the criteria for filtering the lines? Should all be preserved that start with ISNAUTH?

    We can help you to implement something when you tell us what you want to do, but we don't know your business and can't help you if you can't even give us a description in natural language.

      Oh geez sorry about that. No the first column does not matter. It's the remaining columns that I care about, I only want to see something that has values greater than zero consistently. For instance let’s say there are 15 rows and the first 5 are zero and within the last 10, 8 have values greater than zero.
        Ok, then the first step is to parse the file. That's been done many times before, and you should re-use such a solution, like Text::CSV or Text::CSV_XS.

        Then you have each row as an array, and you only have to iterate over them and check the condition you described.

        So do it, and when you run into problems come back with more specific questions.

Re: Remove array if data inconsistent
by johngg (Canon) on Apr 22, 2009 at 14:41 UTC

    Perhaps you could discard any lines where the number of zeros exceeds some threshhold that you choose. This will not check for adjacent non-zero values but may be sufficient for your needs.

    use strict; use warnings; open my $csvFH, q{<}, \ <<EOD or die qq{open: <<HEREDOC: $!\n}; JOBINIT,2,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655,0,0,0,0,0,0,0,0,0,0 INVPSWD,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,192,440,192,440,188,450,180,450,189,447,192,440,192,440,188,45 +0,152,438,0,0,36,400,192,440 PWDEXPR,4,222,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,0,0,0,0,0,0,0,0,0,0,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655 UNDFUSER,1,0,1,0,3,4199,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 EOD READFILE: while( <$csvFH> ) { chomp; { no warnings q{numeric}; next READFILE if 15 <= grep $_ == 0, split m{,}; } print qq{$_\n}; } close $csvFH or die qq{close: <<HEREDOC: $!\n};

    The output.

    INSAUTH,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655,0,0,0,0,0,0,0,0,0,0 INSAUTH,192,440,192,440,188,450,180,450,189,447,192,440,192,440,188,45 +0,152,438,0,0,36,400,192,440 INSAUTH,0,0,0,0,0,0,0,0,0,0,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655

    I hope this is helpful.

    Cheers,

    JohnGG

    Update: Fixed code formatting problem caused by a TAB and changed 0s to zeros for readability.

      I've never used grep before and it seems to have been what I was looking for all along. I changed your script a little but this is what did. I added an option when I run the script to be $occurance because it's not going to be the same number every time. Thanks a ton for everyones help.
      open IN, '<'.$inFlNm or die "Unable to open >$inFlNm< : $!\n"; READFILE: while( <IN> ) { chomp; next READFILE if $occurance > grep $_ != 0, split m{,}; print OUT qq{$_\n}; } close IN;
Re: Remove array if data inconsistent
by przemo (Scribe) on Apr 22, 2009 at 14:24 UTC

    As moritz pointed out, you must know what concepts data represent to decide on what is important, e.g. a series of 0-s might be useless as suggested by you, when in sound silence-finding application it is crucial.

    My advice is to try a little bit looking what might be promissing for you. Experiment with regular expressions filtering and see it obtained results are interesting. If so, go on and filter/ group/ sort more, ad infinitum. Something like

    perl -ne 'print if /^INSAUTH\b/'

    would produce the output you mention.

    Further, Text::CSV is a possible improvement on data analysis preparations.

Re: Remove array if data inconsistent
by jhourcle (Prior) on Apr 22, 2009 at 14:59 UTC

    It's not a direct answer to your question, which I think people have already commented on -- but once that you've pared down the list, you might want to consider making some sort of visualization of the line, so that you can quickly scan it visually and see what might be worth investigating further.

    For the data that you're dealing with, I'd probably look at using sparklines -- there's a few CPAN modules to generate them.

    Then, you can look at a page of graphs, and see which ones are stable / going up / random / etc.

      Along the same lines, you could always simply sort the different lines into two different output files if you are worried about record keeping. For instance, once you decide how you want to filter/screen the data (probably by one of the methods discussed in the previous posts) you could put a simple if-else check to send the 'consistent' data lines to a file named "good_data.csv" and the 'inconsistent' data lines to a file named "discarded_data.csv" (or whichever names you like) so that you can review the discarded data at a later time if need be, or recover any lines that were filtered incorrectly. This also might help you find any errors or bugs in you code later on as you test it more thoroughly.

      I am thinking something along these lines (pseudocode) (in case you aren't terribly familiar with syntax):

      open INPUT, "name_of_input_file" or die: $!; open GOOD_OUTPUT, ">good_data.csv" or die $!; open BAD_OUTPUT, ">discarded_data.csv" or die $!; while(my $line = <INPUT>){ chomp $line; if(data_meets_good_condition){ print GOOD_OUTPUT "$line\n"; }else{ print BAD_OUTPUT "$line\n"; } }

      That's a little rudimentary (and verbiose if you are a fan of golf) and needs an appropriate logical check in the if condition, but the primary idea is the if-else check, that way you don't totally wipe data by accident. Again, in terms of filtering the data, some of the methods discussed above are probably better.

Re: Remove array if data inconsistent
by kennethk (Abbot) on Apr 22, 2009 at 14:50 UTC
    How is this file generated? I would suggest that depending on how these "inconsistent" records are generated and their relative abundance, you may want to spit out an error message instead of dropping data. Specifically, if these are the results of a human data entry failure, quietly dropping records may be counter-productive.