Match Line And Combine Into One Line

jlope043 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Match Line And Combine Into One Line by kcott (Archbishop) on Jul 21, 2016 at 18:03 UTC
G'day jlope043, Your match needs to capture the common and unique parts of each line. You can then output the common part once followed by all the unique parts joined with spaces. Here's the guts of what you need: #!/usr/bin/env perl -l use strict; use warnings; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; while (<DATA>) { chomp; /$re/; push @{ $reformat{$1} }, $2; } print $_, join ' ', @{ $reformat{$_} } for keys %reformat; __DATA__ H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE [download] Output: `H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H432165,20151209,SINGLE LINE FOR THIS ONE H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT` [download] You may want some additional ordering to your output but I don't know what that is: sort by 'H' number value; the order that 'H' numbers appear; something else. The code I've shown is very basic. You can probably find explanations for any part of it in perlintro; however, feel free to ask if you need further help. — Ken	[reply] [d/l] [select]
Re^2: Match Line And Combine Into One Line by jlope043 (Acolyte) on Jul 21, 2016 at 18:45 UTC
Hi Ken, thank you and sorry I should have included the headers of the INPUT file `ACCOUNT,DATE,NOTE H123456,20151209,THIS IS A TEST` [download] All my accounts begin with an alphanumeric which is the reason the H is present, on my other scripts I have I usually use a my $find to simplify my search for the account, example below. `my $find = '^(H0\|HT)'` One question I do have is I normally write my scripts to export to a new file in this case what would be the correct format to do so? I have this but I think I am missing something `use strict; use warnings; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; open (NEW, ">", "Notes_Test_OUTPUT.txt" ) or die "could not open:$ +!"; open (FILE, "<", "Notes_Test.txt") or die "could not open:$!"; while (<FILE>) { chomp; /$re/; push @{ $reformat{$1} }, $2; } print NEW if $_, join ' ', @{ $reformat{$_} } for keys %reformat; close (FILE); close (NEW);` [download]	[reply] [d/l] [select]
Re^3: Match Line And Combine Into One Line by kcott (Archbishop) on Jul 21, 2016 at 20:25 UTC
Firstly, this code won't compile as it contains a syntax error. Do not just post untested code! If you don't understand an error message, post the error you're getting and ask. Here's the offending line: `print NEW if $_, join ' ', @{ $reformat{$_} } for keys %reformat;` [download] Take a look at "perlsyn: Statement Modifiers". The very first sentence starts with: Any simple statement may optionally be followed by a SINGLE modifier, ... "SINGLE" is emphasised for a very good reason: you can only use one statement modifier per statement. In the line I've identified, you've used two: `if` and `for`. Had you tried to run your code, you would have got a syntax error similar to the one in this example: `$ perl -e 'my @x = qw{a b}; print if $_ for @x' syntax error at -e line 1, near "$_ for " Execution of -e aborted due to compilation errors.` [download] You have another issue that isn't an error but which would generate warning messages. The problem is that you haven't accounted for the file header line. You can skip this line with the simple expedient of adding this as the first line of your `while` loop: `next if $. == 1;` [download] `$.` is a special variable that holds the line count. Line 1 is the header line and `next` will effectively ignore it. See "perlvar: Variables related to filehandles" for a more detailed description. It's good that you've used the 3-argument form of open; it's less good that you've chosen global package variables to hold the filehandles and, indeed worse, that you've not chosen meaningful names. Once you get into the habit of using names like `FILE`, you'll use them often and, in all likelihood, multiple times in the same script or module: this is highly error-prone and can lead to bugs that are hard to track down. Instead, use lexical variables, with meaningful names, in the smallest possible scope; this greatly reduces the chances of errors and, in many cases, means you don't even need to use `close` as Perl will do this for you. It's also good that you're checking for I/O errors with "`or die 'error message'`" code; however, hand-crafting these messages is tedious and it's easy to leave out important information or forget to add them altogether. If you use the autodie pragma, Perl will perform this task for you: less work for you and less chances of errors. Putting all that together, along with your additional information, here's a new version of the script. Although not shown, my original script was `pm_1168253_reformat_input.pl`, this one's called `pm_1168253_reformat_input_WITH_FILES.pl`. `#!/usr/bin/env perl -l use strict; use warnings; use autodie; my $input_file = 'pm_1168253_reformat_input_INPUT.txt'; my $output_file = 'pm_1168253_reformat_input_OUTPUT.txt'; my %reformat; my $re = qr{^(H\d+,\d+,)(.*)$}; { open my $in_fh, '<', $input_file; while (<$in_fh>) { next if $. == 1; chomp; /$re/; push @{ $reformat{$1} }, $2; } } { open my $out_fh, '>', $output_file; print $out_fh $_, join ' ', @{ $reformat{$_} } for keys %reformat; }` [download] Note the anonymous blocks. The filehandles go out of scope once these blocks are exited: their reference counts are reduced to zero and Perl performs an implicit `close`. Here's the input file: `$ cat pm_1168253_reformat_input_INPUT.txt ACCOUNT,DATE,NOTE H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE` [download] And here's the output file before and after running the script: `$ cat pm_1168253_reformat_input_OUTPUT.txt cat: pm_1168253_reformat_input_OUTPUT.txt: No such file or directory $ pm_1168253_reformat_input_WITH_FILES.pl $ cat pm_1168253_reformat_input_OUTPUT.txt H432165,20151209,SINGLE LINE FOR THIS ONE H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT` [download] As before, you may need a different ordering for your output but I'm still in the dark as to what you require. — Ken	[reply] [d/l] [select]
Re^4: Match Line And Combine Into One Line by jlope043 (Acolyte) on Jul 21, 2016 at 23:45 UTC
Re^3: Match Line And Combine Into One Line [Follow-up] by kcott (Archbishop) on Jul 22, 2016 at 14:04 UTC
This is a follow-up to my earlier post. Firstly, I hope I didn't give the impression that there was anything special about anonymous blocks and scoping. That's standard block behaviour. From "perlsub: Private Variables via my()": The `my` operator declares the listed variables to be lexically confined to the enclosing block, conditional (`if` /`unless` /`elsif` /`else` ), loop (`for` /`foreach` /`while` /`until` /`continue`), subroutine, `eval`, or `do`/`require`/`use`'d file. ... I used anonymous blocks to demonstrate scoping issues; however, in a real-world application, far from being anonymous, they probably would be named to allow reuse: perhaps, `&read_input` and `&write_output`, called in a loop iterating input filenames from the command line. I made some modifications to (a copy of) the earlier script, to demonstrate: #!/usr/bin/env perl -l use strict; use warnings; use autodie; for (@ARGV) { my ($input_file, $output_file) = ($_, $_ . '__OUTPUT.txt'); my (%reformat, @order); read_input($input_file, \%reformat, \@order); write_output($output_file, \%reformat, \@order); } { my $re; INIT { $re = qr{^(H\d+,\d+,)(.)$} } sub read_input { my ($input_file, $reformat, $order) = @_; open my $in_fh, '<', $input_file; while (<$in_fh>) { next if $. == 1; chomp; /$re/; my ($key, $str_part) = ($1, $2); push @$order, $key unless exists $reformat->{$key}; push @{ $reformat->{$key} }, $str_part; } return; } } sub write_output { my ($output_file, $reformat, $order) = @_; open my $out_fh, '>', $output_file; print $out_fh $_, join ' ', @{ $reformat->{$_} } for @$order; return; } [download] I've been testing like this (without any problems): $ pm_1168253_reformat_input_WITH_FILES_PRODUCTON.pl pm_1168253_reforma +t_input_INPUT.txt pm_1168253_reformat_input_INPUT_CLONE.txt; cat pm_1 +168253_reformat_input_INPUT.txt__OUTPUT.txt; cat pm_1168253_reformat_ +input_INPUT_CLONE.txt__OUTPUT.txt H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE [download] Notes: If you don't know about `@ARGV`, see "perlvar: Variables related to filehandles". The anonymous block (with the `INIT` block) is described in the Persistent variables with closures* section of "perlsub: Persistent Private Variables". While you're there, take a look at the Persistent variables via state() section (it's just above). Note: The `state` function was introduced in Perl v5.10.0 (see "perl5100delta: state() variables" ). "A new class of variables has been introduced. ..." `INIT`, and friends, are described in "perlmod: BEGIN, UNITCHECK, CHECK, INIT and END". There is now a line (in `&read_input`) to capture the same order as the expected output in your OP: `push @$order, $key unless exists $reformat->{$key};` [download] — Ken	[reply] [d/l] [select]
Re^3: Match Line And Combine Into One Line by Corion (Patriarch) on Jul 21, 2016 at 18:52 UTC
What makes you think that you are missing something?	[reply]
Re^4: Match Line And Combine Into One Line by jlope043 (Acolyte) on Jul 21, 2016 at 19:00 UTC
Re^5: Match Line And Combine Into One Line by Corion (Patriarch) on Jul 21, 2016 at 19:04 UTC
Re: Match Line And Combine Into One Line by pryrt (Abbot) on Jul 21, 2016 at 18:03 UTC
split on commas; make a key from the H-string and the second number (presumably a date); add text to the hash value for that key (include a space before the added text if there's already text in the key). If you want to keep the same order as the first match of a given key, you'll need to keep the keys in an array as well. Once done, loop thru the keys and output. code use strict; use warnings; my %hash = (); my @order = (); foreach (<DATA>) { chomp; my ($h,$d,$txt) = split /,/; my $k = join ',', $h, $d; push @order, $k unless exists $hash{$k}; $hash{$k} .= ' ' if $hash{$k}; $hash{$k} .= $txt; } foreach my $k (@order) { print "$k,$hash{$k}\n"; } __DATA__ H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE [download] output `H123456,20151209,THIS IS A TEST TO COMBINE ALL MY MATCHING LINES INTO +THE FIRST LINE THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS ACCT INTO THE TOP LINE OF THE AC +CT H432165,20151209,SINGLE LINE FOR THIS ONE` [download] If you can guarantee that they keys will always stay together, you can change the logic to not need a second loop, and just output the value of the old key once the key has changed; don't even need a hash in that version, just a $k and $txt	[reply] [d/l] [select]
Re: Match Line And Combine Into One Line by Anonymous Monk on Jul 21, 2016 at 18:04 UTC
`#!/usr/bin/perl # http://perlmonks.org/?node_id=1168253 use strict; use warnings; $_ = join '', <DATA>; 1 while s/^(\w+,\w+,)(.*)\K\n\1/ /m; print; __DATA__ H123456,20151209,THIS IS A TEST H123456,20151209,TO COMBINE ALL H123456,20151209,MY MATCHING LINES H123456,20151209,INTO THE FIRST LINE H123456,20151209,THAT MATCHES. H654321,20151209,MATCH LINES FOR THIS H654321,20151209,ACCT INTO THE H654321,20151209,TOP LINE OF THE ACCT H432165,20151209,SINGLE LINE FOR THIS ONE` [download]	[reply] [d/l]
Re: Match Line And Combine Into One Line by Anonymous Monk on Jul 23, 2016 at 09:30 UTC
Maybe you can use SQL for this, if indeed someone gave you some report. Just a suggestion, of course./p>	[reply]

code

output