jlope043 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Everyone, I am very new to Perl and have been learning it for the last few months now, but I have a scenario here that is a little out of my skill level and seeking help from the Pros on here. I have a report that I was given and need to reformat the report into a cleaner file to import into my system. I haven't a clue how to start to write this code which is why I didn't include one. I know I will need something like a MATCH or FIND, maybe a CHOMP and so one. What I want to do is combine the lines into the matching account. I have included a copy of my INPUT file and the format of my expected OUPUT file. Any help and explanation of the code would be great! Thank you in advance.

input

H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE

expected output

H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT. H432165,20151209,SINGLE LINE FOR THIS ONE

Replies are listed 'Best First'.
Re: Match Line And Combine Into One Line
by kcott (Archbishop) on Jul 21, 2016 at 18:03 UTC

    G'day jlope043,

    Your match needs to capture the common and unique parts of each line. You can then output the common part once followed by all the unique parts joined with spaces. Here's the guts of what you need:

    #!/usr/bin/env perl -l use strict; use warnings; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; while (<DATA>) { chomp; /$re/; push @{ $reformat{$1} }, $2; } print $_, join ' ', @{ $reformat{$_} } for keys %reformat; __DATA__ H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE

    Output:

    H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H432165,20151209,SINGLE LINE FOR THIS ONE H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT

    You may want some additional ordering to your output but I don't know what that is: sort by 'H' number value; the order that 'H' numbers appear; something else.

    The code I've shown is very basic. You can probably find explanations for any part of it in perlintro; however, feel free to ask if you need further help.

    — Ken

      Hi Ken, thank you and sorry I should have included the headers of the INPUT file

      ACCOUNT,DATE,NOTE H123456,20151209,THIS IS A TEST

      All my accounts begin with an alphanumeric which is the reason the H is present, on my other scripts I have I usually use a my $find to simplify my search for the account, example below.

      my $find = '^(H0|HT)'

      One question I do have is I normally write my scripts to export to a new file in this case what would be the correct format to do so? I have this but I think I am missing something

      use strict; use warnings; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; open (NEW, ">", "Notes_Test_OUTPUT.txt" ) or die "could not open:$ +!"; open (FILE, "<", "Notes_Test.txt") or die "could not open:$!"; while (<FILE>) { chomp; /$re/; push @{ $reformat{$1} }, $2; } print NEW if $_, join ' ', @{ $reformat{$_} } for keys %reformat; close (FILE); close (NEW);

        Firstly, this code won't compile as it contains a syntax error. Do not just post untested code! If you don't understand an error message, post the error you're getting and ask. Here's the offending line:

        print NEW if $_, join ' ', @{ $reformat{$_} } for keys %reformat;

        Take a look at "perlsyn: Statement Modifiers". The very first sentence starts with:

        Any simple statement may optionally be followed by a SINGLE modifier, ...

        "SINGLE" is emphasised for a very good reason: you can only use one statement modifier per statement. In the line I've identified, you've used two: if and for. Had you tried to run your code, you would have got a syntax error similar to the one in this example:

        $ perl -e 'my @x = qw{a b}; print if $_ for @x' syntax error at -e line 1, near "$_ for " Execution of -e aborted due to compilation errors.

        You have another issue that isn't an error but which would generate warning messages. The problem is that you haven't accounted for the file header line. You can skip this line with the simple expedient of adding this as the first line of your while loop:

        next if $. == 1;

        $. is a special variable that holds the line count. Line 1 is the header line and next will effectively ignore it. See "perlvar: Variables related to filehandles" for a more detailed description.

        It's good that you've used the 3-argument form of open; it's less good that you've chosen global package variables to hold the filehandles and, indeed worse, that you've not chosen meaningful names. Once you get into the habit of using names like FILE, you'll use them often and, in all likelihood, multiple times in the same script or module: this is highly error-prone and can lead to bugs that are hard to track down. Instead, use lexical variables, with meaningful names, in the smallest possible scope; this greatly reduces the chances of errors and, in many cases, means you don't even need to use close as Perl will do this for you.

        It's also good that you're checking for I/O errors with "or die 'error message'" code; however, hand-crafting these messages is tedious and it's easy to leave out important information or forget to add them altogether. If you use the autodie pragma, Perl will perform this task for you: less work for you and less chances of errors.

        Putting all that together, along with your additional information, here's a new version of the script. Although not shown, my original script was pm_1168253_reformat_input.pl, this one's called pm_1168253_reformat_input_WITH_FILES.pl.

        #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $input_file = 'pm_1168253_reformat_input_INPUT.txt'; my $output_file = 'pm_1168253_reformat_input_OUTPUT.txt'; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; { open my $in_fh, '<', $input_file; while (<$in_fh>) { next if $. == 1; chomp; /$re/; push @{ $reformat{$1} }, $2; } } { open my $out_fh, '>', $output_file; print $out_fh $_, join ' ', @{ $reformat{$_} } for keys %reformat; }

        Note the anonymous blocks. The filehandles go out of scope once these blocks are exited: their reference counts are reduced to zero and Perl performs an implicit close.

        Here's the input file:

        $ cat pm_1168253_reformat_input_INPUT.txt ACCOUNT,DATE,NOTE H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE

        And here's the output file before and after running the script:

        $ cat pm_1168253_reformat_input_OUTPUT.txt cat: pm_1168253_reformat_input_OUTPUT.txt: No such file or directory $ pm_1168253_reformat_input_WITH_FILES.pl $ cat pm_1168253_reformat_input_OUTPUT.txt H432165,20151209,SINGLE LINE FOR THIS ONE H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT

        As before, you may need a different ordering for your output but I'm still in the dark as to what you require.

        — Ken

        This is a follow-up to my earlier post.

        Firstly, I hope I didn't give the impression that there was anything special about anonymous blocks and scoping. That's standard block behaviour. From "perlsub: Private Variables via my()":

        The my operator declares the listed variables to be lexically confined to the enclosing block, conditional (if /unless /elsif /else ), loop (for /foreach /while /until /continue), subroutine, eval, or do/require/use'd file. ...

        I used anonymous blocks to demonstrate scoping issues; however, in a real-world application, far from being anonymous, they probably would be named to allow reuse: perhaps, &read_input and &write_output, called in a loop iterating input filenames from the command line. I made some modifications to (a copy of) the earlier script, to demonstrate:

        #!/usr/bin/env perl -l use strict; use warnings; use autodie; for (@ARGV) { my ($input_file, $output_file) = ($_, $_ . '__OUTPUT.txt'); my (%reformat, @order); read_input($input_file, \%reformat, \@order); write_output($output_file, \%reformat, \@order); } { my $re; INIT { $re = qr{^(H\d+,\d+,)(.*)$} } sub read_input { my ($input_file, $reformat, $order) = @_; open my $in_fh, '<', $input_file; while (<$in_fh>) { next if $. == 1; chomp; /$re/; my ($key, $str_part) = ($1, $2); push @$order, $key unless exists $reformat->{$key}; push @{ $reformat->{$key} }, $str_part; } return; } } sub write_output { my ($output_file, $reformat, $order) = @_; open my $out_fh, '>', $output_file; print $out_fh $_, join ' ', @{ $reformat->{$_} } for @$order; return; }

        I've been testing like this (without any problems):

        $ pm_1168253_reformat_input_WITH_FILES_PRODUCTON.pl pm_1168253_reforma +t_input_INPUT.txt pm_1168253_reformat_input_INPUT_CLONE.txt; cat pm_1 +168253_reformat_input_INPUT.txt__OUTPUT.txt; cat pm_1168253_reformat_ +input_INPUT_CLONE.txt__OUTPUT.txt H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE

        Notes:

        — Ken

        What makes you think that you are missing something?

Re: Match Line And Combine Into One Line
by pryrt (Abbot) on Jul 21, 2016 at 18:03 UTC

    split on commas; make a key from the H-string and the second number (presumably a date); add text to the hash value for that key (include a space before the added text if there's already text in the key). If you want to keep the same order as the first match of a given key, you'll need to keep the keys in an array as well. Once done, loop thru the keys and output.

    code

    use strict; use warnings; my %hash = (); my @order = (); foreach (<DATA>) { chomp; my ($h,$d,$txt) = split /,/; my $k = join ',', $h, $d; push @order, $k unless exists $hash{$k}; $hash{$k} .= ' ' if $hash{$k}; $hash{$k} .= $txt; } foreach my $k (@order) { print "$k,$hash{$k}\n"; } __DATA__ H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE

    output

    H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE

    If you can guarantee that they keys will always stay together, you can change the logic to not need a second loop, and just output the value of the old key once the key has changed; don't even need a hash in that version, just a $k and $txt

Re: Match Line And Combine Into One Line
by Anonymous Monk on Jul 21, 2016 at 18:04 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1168253 use strict; use warnings; $_ = join '', <DATA>; 1 while s/^(\w+,\w+,)(.*)\K\n\1/ /m; print; __DATA__ H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE
Re: Match Line And Combine Into One Line
by Anonymous Monk on Jul 23, 2016 at 09:30 UTC

    Maybe you can use SQL for this, if indeed someone gave you some report. Just a suggestion, of course./p>