When parsing any data, the first step is to think about the format and what separates the data. Here the format is space separated tokens. Each token has some identifier followed by an optional comma and then comma separated values.
The data appears to be very regular and that makes it easy to parse. Don't over complicate things. The first step tokenizer should just split each line into tokens based upon whitespace. Each token can then be split on ",". No fancy regex stuff appears to be required here. Use the easiest tool to get the job done.Examples: UNITS,PPM TZONE,HST,10 BEGIN_FILE DH1,150031001,9,8,5,6,5,5,8,9,8,7,4,-999,5
When you see a new measurement variable like CO or NO2, just keep track of that change and print the data if any. It appears that you are counting number of 24 hour measurement days from reporting stations for particular types of measurements, in particular PM 2.5 whatever that means. I don't see any need to pay attention to the start or end of data flags as all that appears to be necessary is to pay attention to the tokens with lots of comma's in them - so just count commas!
So this does that. I just cut-n-pasted your data into a __DATA__ segment to run my code then chopped most of it off for posting here. Again, I have no idea what date you want - this data has lots and lots of dates and times! My count of stns reporting doesn't agree with your output line, but that is probably because there are extra conditions that you didn't explain.
#!/usr/bin/perl -w use strict; my $data = <DATA>; my @data = split(/\s+/,$data); #print "$_\n" foreach @data; #run to see what data looks like my %stns; my $variable = undef; foreach my $token (@data) { my @tokens = split(/,/,$token); if ($tokens[0] eq 'VARIABLE') { print_line(); $variable = $tokens[1]; } if ( @tokens > 15) #stations with 24 hour data { $stns{$tokens[0]}++; } } print_line(); #for the last data set sub print_line { return if (!defined($variable)) ; #no data yet return if (!keys %stns); #no 24 point data print "$variable DATE? "; print "$_ $stns{$_} " foreach (sort keys %stns); print "\n"; %stns = (); } =prints CO DATE? DH1 2 KA5 2 NO2 DATE? KA5 2 WB6 2 OZONE DATE? SI2 2 PM10 DATE? DH1 2 KA5 2 PC 2 WB6 2 PM2.5 DATE? DH1 2 HL11 2 KA5 2 KH19 2 KN12 2 MV17 2 OV20 2 PA16 2 PC 2 + SI2 2 SO2 DATE? DH1 2 HL11 2 KA5 2 KN12 2 MV17 2 OV20 2 PA16 2 PE10 2 WB6 2 WD DATE? DH1 2 HL11 2 KA5 2 KN12 2 MV17 2 OV20 2 PA16 2 PC 2 PE10 2 SI +2 2 WB6 2 WS DATE? DH1 2 HL11 2 KA5 2 KN12 2 MV17 2 OV20 2 PA16 2 PC 2 PE10 2 SI +2 2 WB6 2 =cut __DATA__ BEGIN_FILE FORMAT_VERSION,2 AGENCY,HI1 FILENAME,090913.HI1 MORE OF YOUR DATA
In reply to Re: Data Parsing help for newbie HELP ME!!
by Marshall
in thread Data Parsing help for newbie HELP ME!!
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |