coding1227 has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks,

I have a formatting & data printing question for you. I have a file that has hundreds of lines containing columns of organized information. The data can be separated by a tab, or multiple spaces. In each line, the only “valid sets” are those sets that contain the structures: SATID XX VAL1 XX VAL2 XXX SIGNAL XX.

Hence, since my file sometimes has “invalid sets” that do not match exactly the structure of the “valid sets” mentioned earlier. These “bad data” can be found a single or numerous times in a single line. An example of these “invalid” sets could be:
SATID 18 SATID 17 VAL1 49 VAL2 038

Therefore, is there a way that Perl could automatically find the “invalid sets” and replace them with blank spaces (so as to preserve the original spacing/alignment of columns in the large file?

For example, if I have the following raw data: Timestamp: 00:55:46 SATID 17 VAL1 49 VAL2 038 SIGNAL 39 SATID 18 S +ATID 17 VAL1 49 VAL2 038 SATID 19 VAL1 69 VAL2 015 SIGNAL NA + SATID 39 SATID 28 VAL1 36 VAL2 073 SIGNAL + 21 The “corrected” data line should be: Timestamp: 00:55:46 SATID 17 VAL1 49 VAL2 038 SIGNAL 39 + SATID 19 VAL1 69 VAL2 015 SIGNAL NA + SATID 39 SATID 28 VAL1 36 VAL2 073 SIGNAL + 21
I’m not sure how to tackle this… I think that using an array would be the way to go, but I’m not sure how to do this so that the matching structure is enforced & the correct substution (when needed) is implemented.


Does anyone have any ideas/examples that could be of help? Below is the code that I have so far:
#!/usr/bin/perl -l use strict; use warnings; my @lines; while(<DATA>) { push (@lines, $_); } print @lines; # see if it worked __DATA__ Timestamp: 00:55:46 SATID 17 VAL1 49 VAL2 038 SIGNAL 39 SATID 18 S +ATID 17 VAL1 49 VAL2 038 SATID 19 VAL1 69 VAL2 015 SIGNAL NA + SATID 39 SATID 28 VAL1 36 VAL2 073 SIGNAL + 21

Thanks =)

Replies are listed 'Best First'.
Re: Preserving "Valid" Data?
by tybalt89 (Monsignor) on Apr 04, 2017 at 03:24 UTC

    Assuming your SATID sections are all the same length.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1186932 use strict; use warnings; while(<DATA>) { s/SATID.{21}(?:(SIGNAL ..)|.{9})/ $1 ? $& : ' ' x length $& /ge; print; } __DATA__ Timestamp: 00:55:46 SATID 17 VAL1 49 VAL2 038 SIGNAL 39 SATID 18 S +ATID 17 VAL1 49 VAL2 038 SATID 19 VAL1 69 VAL2 015 SIGNAL NA + SATID 39 SATID 28 VAL1 36 VAL2 073 SIGNAL + 21

    UPDATE: I think more test cases are needed.

      thanks tybalt89! This seems to work very well! =)

      Ok... one last question:
      Say I have a file containing the following data lines: Timestamp: 00:47:14 SATID 13 VAL1 28 VAL2 227 SIGNAL 37 + SATID 15 VAL1 22 VAL2 265 SIGNAL 30 SATID 16 VA +L1 22 VAL2 265 SIGNAL 30 Timestamp: 00:48:14 SATID 13 VAL1 28 VAL2 227 SIGNAL 37 + SATID 15 VAL1 22 VAL2 265 SIGNAL NA SATID 16 VA +L1 22 VAL2 265 SIGNAL 30 Timestamp: 00:49:14 SATID 14 VAL1 +22 VAL2 265 SIGNAL 30
      Can I use perl so that it will automatically "fill in" the missing cells with the ID of the SATID for its corresponding column, while filling in the VAL1, VAL2 and SIGNAL with "nan"?
      For instance, the desired output would be: Timestamp: 00:47:14 SATID 13 VAL1 28 VAL2 227 SIGNAL 37 SATID 14 VAL1 +nan VAL2 nan SIGNAL nan SATID 15 VAL1 22 VAL2 265 SIGNAL 30 SATID 16 +VAL1 22 VAL2 265 SIGNAL 30 Timestamp: 00:48:14 SATID 13 VAL1 28 VAL2 227 SIGNAL 37 SATID 14 VAL1 +nan VAL2 nan SIGNAL nan SATID 15 VAL1 22 VAL2 265 SIGNAL nan SATID 16 + VAL1 22 VAL2 265 SIGNAL 30 Timestamp: 00:49:14 SATID 13 VAL1 nan VAL2 nan SIGNAL nan SATID 14 VAL +1 22 VAL2 265 SIGNAL 30 SATID 15 VAL1 nan VAL2 nan SIGNAL nan SATID 1 +6 VAL1 nan VAL2 nan SIGNAL nan
      Thanks again! =)

        Try this. I changed two of the nan to just na to maintain column alignment.

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1186932 use strict; use warnings; my $high = 0; while(<DATA>) { $high < $_ and $high = $_ for /SATID (\d\d)/g; # middle s/SATID (\d\d) .{17}SIGNAL \d\d \K {35}(?= SATID (\d\d))/SATID @{[ $1 + $2 >> 1]} VAL1 na VAL2 nan SIGNAL na/g; # beginning s/\d:\d\d:\d\d \K {35}(?= SATID (\d\d))/SATID @{[ $1 - 1]} VAL1 na VAL2 nan SIGNAL na/; # end while( /.*SATID (\d\d)/ and $1 < $high ) { my $nextnumber = $1 + 1; s/$/ SATID $nextnumber VAL1 na VAL2 nan SIGNAL na/; } print; } __DATA__ Timestamp: 00:47:14 SATID 13 VAL1 28 VAL2 227 SIGNAL 37 + SATID 15 VAL1 22 VAL2 265 SIGNAL 30 SATID 16 VA +L1 22 VAL2 265 SIGNAL 30 Timestamp: 00:48:14 SATID 13 VAL1 28 VAL2 227 SIGNAL 37 + SATID 15 VAL1 22 VAL2 265 SIGNAL NA SATID 16 VA +L1 22 VAL2 265 SIGNAL 30 Timestamp: 00:49:14 SATID 14 VAL1 +22 VAL2 265 SIGNAL 30