in reply to Match Line And Combine Into One Line

G'day jlope043,

Your match needs to capture the common and unique parts of each line. You can then output the common part once followed by all the unique parts joined with spaces. Here's the guts of what you need:

#!/usr/bin/env perl -l use strict; use warnings; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; while (<DATA>) { chomp; /$re/; push @{ $reformat{$1} }, $2; } print $_, join ' ', @{ $reformat{$_} } for keys %reformat; __DATA__ H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE

Output:

H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H432165,20151209,SINGLE LINE FOR THIS ONE H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT

You may want some additional ordering to your output but I don't know what that is: sort by 'H' number value; the order that 'H' numbers appear; something else.

The code I've shown is very basic. You can probably find explanations for any part of it in perlintro; however, feel free to ask if you need further help.

— Ken

Replies are listed 'Best First'.
Re^2: Match Line And Combine Into One Line
by jlope043 (Acolyte) on Jul 21, 2016 at 18:45 UTC

    Hi Ken, thank you and sorry I should have included the headers of the INPUT file

    ACCOUNT,DATE,NOTE H123456,20151209,THIS IS A TEST

    All my accounts begin with an alphanumeric which is the reason the H is present, on my other scripts I have I usually use a my $find to simplify my search for the account, example below.

    my $find = '^(H0|HT)'

    One question I do have is I normally write my scripts to export to a new file in this case what would be the correct format to do so? I have this but I think I am missing something

    use strict; use warnings; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; open (NEW, ">", "Notes_Test_OUTPUT.txt" ) or die "could not open:$ +!"; open (FILE, "<", "Notes_Test.txt") or die "could not open:$!"; while (<FILE>) { chomp; /$re/; push @{ $reformat{$1} }, $2; } print NEW if $_, join ' ', @{ $reformat{$_} } for keys %reformat; close (FILE); close (NEW);

      Firstly, this code won't compile as it contains a syntax error. Do not just post untested code! If you don't understand an error message, post the error you're getting and ask. Here's the offending line:

      print NEW if $_, join ' ', @{ $reformat{$_} } for keys %reformat;

      Take a look at "perlsyn: Statement Modifiers". The very first sentence starts with:

      Any simple statement may optionally be followed by a SINGLE modifier, ...

      "SINGLE" is emphasised for a very good reason: you can only use one statement modifier per statement. In the line I've identified, you've used two: if and for. Had you tried to run your code, you would have got a syntax error similar to the one in this example:

      $ perl -e 'my @x = qw{a b}; print if $_ for @x' syntax error at -e line 1, near "$_ for " Execution of -e aborted due to compilation errors.

      You have another issue that isn't an error but which would generate warning messages. The problem is that you haven't accounted for the file header line. You can skip this line with the simple expedient of adding this as the first line of your while loop:

      next if $. == 1;

      $. is a special variable that holds the line count. Line 1 is the header line and next will effectively ignore it. See "perlvar: Variables related to filehandles" for a more detailed description.

      It's good that you've used the 3-argument form of open; it's less good that you've chosen global package variables to hold the filehandles and, indeed worse, that you've not chosen meaningful names. Once you get into the habit of using names like FILE, you'll use them often and, in all likelihood, multiple times in the same script or module: this is highly error-prone and can lead to bugs that are hard to track down. Instead, use lexical variables, with meaningful names, in the smallest possible scope; this greatly reduces the chances of errors and, in many cases, means you don't even need to use close as Perl will do this for you.

      It's also good that you're checking for I/O errors with "or die 'error message'" code; however, hand-crafting these messages is tedious and it's easy to leave out important information or forget to add them altogether. If you use the autodie pragma, Perl will perform this task for you: less work for you and less chances of errors.

      Putting all that together, along with your additional information, here's a new version of the script. Although not shown, my original script was pm_1168253_reformat_input.pl, this one's called pm_1168253_reformat_input_WITH_FILES.pl.

      #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $input_file = 'pm_1168253_reformat_input_INPUT.txt'; my $output_file = 'pm_1168253_reformat_input_OUTPUT.txt'; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; { open my $in_fh, '<', $input_file; while (<$in_fh>) { next if $. == 1; chomp; /$re/; push @{ $reformat{$1} }, $2; } } { open my $out_fh, '>', $output_file; print $out_fh $_, join ' ', @{ $reformat{$_} } for keys %reformat; }

      Note the anonymous blocks. The filehandles go out of scope once these blocks are exited: their reference counts are reduced to zero and Perl performs an implicit close.

      Here's the input file:

      $ cat pm_1168253_reformat_input_INPUT.txt ACCOUNT,DATE,NOTE H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE

      And here's the output file before and after running the script:

      $ cat pm_1168253_reformat_input_OUTPUT.txt cat: pm_1168253_reformat_input_OUTPUT.txt: No such file or directory $ pm_1168253_reformat_input_WITH_FILES.pl $ cat pm_1168253_reformat_input_OUTPUT.txt H432165,20151209,SINGLE LINE FOR THIS ONE H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT

      As before, you may need a different ordering for your output but I'm still in the dark as to what you require.

      — Ken

        Thank you very much Ken and sorry about all the confusion. This stuff is not as easy as I thought it would be, so trying to explain my goal or outcome is difficult at times. But this worked exactly how I want it to. Thank you again for all your help and explanation.

      This is a follow-up to my earlier post.

      Firstly, I hope I didn't give the impression that there was anything special about anonymous blocks and scoping. That's standard block behaviour. From "perlsub: Private Variables via my()":

      The my operator declares the listed variables to be lexically confined to the enclosing block, conditional (if /unless /elsif /else ), loop (for /foreach /while /until /continue), subroutine, eval, or do/require/use'd file. ...

      I used anonymous blocks to demonstrate scoping issues; however, in a real-world application, far from being anonymous, they probably would be named to allow reuse: perhaps, &read_input and &write_output, called in a loop iterating input filenames from the command line. I made some modifications to (a copy of) the earlier script, to demonstrate:

      #!/usr/bin/env perl -l use strict; use warnings; use autodie; for (@ARGV) { my ($input_file, $output_file) = ($_, $_ . '__OUTPUT.txt'); my (%reformat, @order); read_input($input_file, \%reformat, \@order); write_output($output_file, \%reformat, \@order); } { my $re; INIT { $re = qr{^(H\d+,\d+,)(.*)$} } sub read_input { my ($input_file, $reformat, $order) = @_; open my $in_fh, '<', $input_file; while (<$in_fh>) { next if $. == 1; chomp; /$re/; my ($key, $str_part) = ($1, $2); push @$order, $key unless exists $reformat->{$key}; push @{ $reformat->{$key} }, $str_part; } return; } } sub write_output { my ($output_file, $reformat, $order) = @_; open my $out_fh, '>', $output_file; print $out_fh $_, join ' ', @{ $reformat->{$_} } for @$order; return; }

      I've been testing like this (without any problems):

      $ pm_1168253_reformat_input_WITH_FILES_PRODUCTON.pl pm_1168253_reforma +t_input_INPUT.txt pm_1168253_reformat_input_INPUT_CLONE.txt; cat pm_1 +168253_reformat_input_INPUT.txt__OUTPUT.txt; cat pm_1168253_reformat_ +input_INPUT_CLONE.txt__OUTPUT.txt H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE

      Notes:

      — Ken

      What makes you think that you are missing something?

        Hi Corion, well first I found a few mistakes I made, but my lines are wrapping and not breaking into new lines for each account. Corrected code below but not working as expected.

        use strict; use warnings; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; open (NEW, ">", "Notes_Test_OUTPUT.txt" ) or die "could not open:$ +!"; open (FILE, "<", "Notes_Test.txt") or die "could not open:$!"; while (<FILE>) { chomp; /$re/; push @{ $reformat{$1} }, $2; print NEW $_, join ' ', @{ $reformat{$_} } for keys %reformat; } close (FILE); close (NEW);