Remove array if data inconsistent

meredib has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remove array if data inconsistent by moritz (Cardinal) on Apr 22, 2009 at 14:11 UTC
The only thing I could thing of would be removing rows where there where a specific number of zeros are in the row, but that’s not going to help me in the long run. So what are the criteria for filtering the lines? Should all be preserved that start with `ISNAUTH`? We can help you to implement something when you tell us what you want to do, but we don't know your business and can't help you if you can't even give us a description in natural language.	[reply] [d/l]
Re^2: Remove array if data inconsistent by meredib (Acolyte) on Apr 22, 2009 at 14:26 UTC
Oh geez sorry about that. No the first column does not matter. It's the remaining columns that I care about, I only want to see something that has values greater than zero consistently. For instance let’s say there are 15 rows and the first 5 are zero and within the last 10, 8 have values greater than zero.	[reply]
Re^3: Remove array if data inconsistent by moritz (Cardinal) on Apr 22, 2009 at 14:40 UTC
Ok, then the first step is to parse the file. That's been done many times before, and you should re-use such a solution, like Text::CSV or Text::CSV_XS. Then you have each row as an array, and you only have to iterate over them and check the condition you described. So do it, and when you run into problems come back with more specific questions.	[reply]
Re: Remove array if data inconsistent by johngg (Canon) on Apr 22, 2009 at 14:41 UTC
Perhaps you could discard any lines where the number of zeros exceeds some threshhold that you choose. This will not check for adjacent non-zero values but may be sufficient for your needs. use strict; use warnings; open my $csvFH, q{<}, \ <<EOD or die qq{open: <<HEREDOC: $!\n}; JOBINIT,2,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655,0,0,0,0,0,0,0,0,0,0 INVPSWD,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,192,440,192,440,188,450,180,450,189,447,192,440,192,440,188,45 +0,152,438,0,0,36,400,192,440 PWDEXPR,4,222,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 INSAUTH,0,0,0,0,0,0,0,0,0,0,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655 UNDFUSER,1,0,1,0,3,4199,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 EOD READFILE: while( <$csvFH> ) { chomp; { no warnings q{numeric}; next READFILE if 15 <= grep $_ == 0, split m{,}; } print qq{$_\n}; } close $csvFH or die qq{close: <<HEREDOC: $!\n}; [download] The output. `INSAUTH,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655,0,0,0,0,0,0,0,0,0,0 INSAUTH,192,440,192,440,188,450,180,450,189,447,192,440,192,440,188,45 +0,152,438,0,0,36,400,192,440 INSAUTH,0,0,0,0,0,0,0,0,0,0,5,7107,4,9343,5,7716,1,0,3,254,1,0,8,5655` [download] I hope this is helpful. Cheers, JohnGG Update: Fixed code formatting problem caused by a TAB and changed `0`s to zeros for readability.	[reply] [d/l] [select]
Re^2: Remove array if data inconsistent by meredib (Acolyte) on Apr 22, 2009 at 16:16 UTC
I've never used grep before and it seems to have been what I was looking for all along. I changed your script a little but this is what did. I added an option when I run the script to be $occurance because it's not going to be the same number every time. Thanks a ton for everyones help. `open IN, '<'.$inFlNm or die "Unable to open >$inFlNm< : $!\n"; READFILE: while( <IN> ) { chomp; next READFILE if $occurance > grep $_ != 0, split m{,}; print OUT qq{$_\n}; } close IN;` [download]	[reply] [d/l]
Re: Remove array if data inconsistent by przemo (Scribe) on Apr 22, 2009 at 14:24 UTC
As moritz pointed out, you must know what concepts data represent to decide on what is important, e.g. a series of 0-s might be useless as suggested by you, when in sound silence-finding application it is crucial. My advice is to try a little bit looking what might be promissing for you. Experiment with regular expressions filtering and see it obtained results are interesting. If so, go on and filter/ group/ sort more, ad infinitum. Something like `perl -ne 'print if /^INSAUTH\b/'` [download] would produce the output you mention. Further, Text::CSV is a possible improvement on data analysis preparations.	[reply] [d/l]
Re: Remove array if data inconsistent by jhourcle (Prior) on Apr 22, 2009 at 14:59 UTC
It's not a direct answer to your question, which I think people have already commented on -- but once that you've pared down the list, you might want to consider making some sort of visualization of the line, so that you can quickly scan it visually and see what might be worth investigating further. For the data that you're dealing with, I'd probably look at using sparklines -- there's a few CPAN modules to generate them. Then, you can look at a page of graphs, and see which ones are stable / going up / random / etc.	[reply]
Re^2: Remove array if data inconsistent by BJ_Covert_Action (Beadle) on Apr 22, 2009 at 16:16 UTC
Along the same lines, you could always simply sort the different lines into two different output files if you are worried about record keeping. For instance, once you decide how you want to filter/screen the data (probably by one of the methods discussed in the previous posts) you could put a simple if-else check to send the 'consistent' data lines to a file named "good_data.csv" and the 'inconsistent' data lines to a file named "discarded_data.csv" (or whichever names you like) so that you can review the discarded data at a later time if need be, or recover any lines that were filtered incorrectly. This also might help you find any errors or bugs in you code later on as you test it more thoroughly. I am thinking something along these lines (pseudocode) (in case you aren't terribly familiar with syntax): `open INPUT, "name_of_input_file" or die: $!; open GOOD_OUTPUT, ">good_data.csv" or die $!; open BAD_OUTPUT, ">discarded_data.csv" or die $!; while(my $line = <INPUT>){ chomp $line; if(data_meets_good_condition){ print GOOD_OUTPUT "$line\n"; }else{ print BAD_OUTPUT "$line\n"; } }` [download] That's a little rudimentary (and verbiose if you are a fan of golf) and needs an appropriate logical check in the if condition, but the primary idea is the if-else check, that way you don't totally wipe data by accident. Again, in terms of filtering the data, some of the methods discussed above are probably better.	[reply] [d/l]
Re: Remove array if data inconsistent by kennethk (Abbot) on Apr 22, 2009 at 14:50 UTC
How is this file generated? I would suggest that depending on how these "inconsistent" records are generated and their relative abundance, you may want to spit out an error message instead of dropping data. Specifically, if these are the results of a human data entry failure, quietly dropping records may be counter-productive.	[reply]