Re: amalgamate similar lines
by Limbic~Region (Chancellor) on Jan 09, 2006 at 13:55 UTC
|
Anonymous Monk,
You likely want to be using a CSV parsing module like Text::CSV_XS or Text::x_SV, but for this example I will be using split. I have chosen to use split because I have made several assumptions about your problem.
Assumptions:
- Each record is contained on a single line
- Each record is pipe delimited and no field contains any imbedded delimiters
- Each record is comprised of 3 fields
- Preservation of record ordering is important
- 2 or more records with the first 2 fields in common are desired to be joined
- These records may be anywhere in the file and are not necessarily adjacent
- Joining records means concatenating the 3rd fields with commas in the order the records appeared in the file
- Concattenated records in the output will be identified by commas in the 3rd field. This assumes no commas appear in the 3rd field prior to merging.
- The joined record will appear at the first occurence in the output
- The machine running the program will have sufficient memory to hold required information in memory
#!/usr/bin/perl
use strict;
use warnings;
my $input = $ARGV[0] || 'sample.txt';
open(my $fh, '<', $input) or die "Unable to open $input for reading: $
+!";
my %data;
while ( <$fh> ) {
chomp;
my @field = split /\|/, $_, 3;
my $key = join '|', @field[0,1];
$data{$key}{line} = $. if ! exists $data{$key};
push @{ $data{$key}{records} }, $field[2];
}
for ( sort { $data{$a}{line} <=> $data{$b}{line} } keys %data ) {
if ( @{ $data{$_}{records} } > 1 ) {
my $field3 = join ',', @{ $data{$_}{records} };
print join '|', $_, $field3;
}
else {
print join '|', $_, $data{$_}{records}[0];
}
print "\n";
}
Please forgive me for the rather tedious solution. I wanted to point out the importance of clearly and concisely stating the problem and assumptions.
Update: Simplified code and clarified assumptions | [reply] [d/l] |
|
- The original 3rd field will not contain commas
I don't see where you assume that; AFAICT your solution will work whether or not that's true. Perhaps you are just pointing out that the operation will not be reversable if there are existing commas?
- The joined record will appear at the first occurence
Implicitly, you are also assuming that records should be merged regardless of their position in the file; it's possible that only adjacent records should be candidates for merging.
| [reply] |
|
ysth,
With regards to the first assumption you called into question, that should have read:
Concattenated records in the output will be identified by commas in the 3rd field. This assumes no commas appear in the 3rd field prior to merging. updated
With regards to second assumption you mentioned. You are correct that since the AM only stated where the first two fields were the same that I assumed that meant they could appear anywhere in the file. That is the point of my post - to clearly state what is desired.
| [reply] [d/l] |
Re: amalgamate similar lines
by g0n (Priest) on Jan 09, 2006 at 13:58 UTC
|
while (<DATA>)
{
my ($onetwo,$three) = $_=~/(.*)\|(.*)/;
push @{$hash{$onetwo}},$three;
}
for (sort keys %hash)
{
print $_."|",join ",",@{$hash{$_}};
print "\n";
}
__DATA__
aaa|bbb|ccc
ddd|eee|fff
ddd|eee|xxxxx
hhh|iiii|jjjjjj
--------------------------------------------------------------
"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
John Brunner, "The Shockwave Rider".
| [reply] [d/l] |
Re: amalgamate similar lines
by smokemachine (Hermit) on Jan 09, 2006 at 13:36 UTC
|
perl -ne 'chomp; $hash{$1}.=$2."," if /^([^\|]+\|[^\|]+)\|([^\|]+)$/;
+END{chop$hash{$_} foreach keys %hash; print "$_|$hash{$_}\n" foreach
+keys %hash}' file_name
| [reply] [d/l] |
Re: amalgamate similar lines
by wfsp (Abbot) on Jan 09, 2006 at 13:46 UTC
|
#!/bin/perl5
use strict;
use warnings;
use Data::Dumper;
my %data;
while (my $record = <DATA>){
chomp $record;
my ($fld1, $fld2, $fld3) = split /\|/, $record;
push @{$data{$fld1}{$fld2}}, $fld3;
}
for my $fld1 (sort keys %data){
for my $fld2 (keys %{$data{$fld1}}){
print
"$fld1|$fld2|",
join(',', @{$data{$fld1}{$fld2}}),
"\n";
}
}
__DATA__
aaa|bbb|ccc
ddd|eee|fff
ddd|eee|xxxxx
hhh|iiii|jjjjjj
output:
---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl
aaa|bbb|ccc
ddd|eee|fff,xxxxx
hhh|iiii|jjjjjj
> Terminated with exit code 0.
| [reply] [d/l] [select] |
Re: amalgamate similar lines
by McDarren (Abbot) on Jan 09, 2006 at 13:52 UTC
|
I'd use the first two fields as a key to a hash, something like the following:
#!/usr/bin/perl -w
use strict;
use Data::Dumper::Simple;
my %first_two;
while (<DATA>) {
chomp;
my ($key, $remainder) = $_ =~ /([a-z]+\|[a-z]+)\|([a-z]+)/;
if (exists $first_two{$key}) {
$first_two{$key} .= ",$remainder";
}
else {
$first_two{$key} = $_;
}
}
print Dumper(%first_two);
__DATA__
aaa|bbb|ccc
ddd|eee|fff
ddd|eee|xxxxx
hhh|iiii|jjjjjj
Which gives:
%first_two = (
'aaa|bbb' => 'aaa|bbb|ccc',
'hhh|iiii' => 'hhh|iiii|jjjjjj',
'ddd|eee' => 'ddd|eee|fff,xxxxx'
);
There are probably many more elegant ways to do it, but that's just the first thing that occurred to me.
Cheers,
Darren :)
| [reply] [d/l] [select] |
Re: amalgamate similar lines
by Aristotle (Chancellor) on Jan 09, 2006 at 19:09 UTC
|
my @line;
my %third_col;
while( <> ) {
chomp;
my @col = split /\|/, $_, -1;
my $key = join '|', @col[ 0, 1 ];
push @line, $key if not exists %third_col{ $key };
push @{ $third_col{ $key } }, $col[ 2 ];
}
for ( @line ) {
print $_ . '|' . join ',', @{ $third_col{ $_ } };
}
Update: fixed a sigil, thanks to Roy Johnson.
Makeshifts last the longest. | [reply] [d/l] |
Re: amalgamate similar lines
by Perl Mouse (Chaplain) on Jan 09, 2006 at 13:39 UTC
|
my %info;
while (<>) {
my ($key, $value) = /([^|]*\|[^|]*)\|(.*)/;
push @{$info{$key}}, $value;
}
while (my ($key, $value) = each %info) {
local $" = ",";
print "$key|@$value\n";
}
__END__
| [reply] [d/l] |
|
| [reply] |
|
| [reply] |
Re: amalgamate similar lines
by DungeonKeeper (Novice) on Jan 09, 2006 at 14:22 UTC
|
my %h = ();
while(<>) {
chop;
my @fld = split( /\|/ );
my $key = shift @fld . "|" . shift @fld;
$h{ $key } &&= $h{ key } . ',';
$h{ $key } .= shift @fld;
shift @fld and die "Analysis error: too many fields $_\n";
}
for my $k ( keys %h ) {
print $k . '|' . $h { $k } . "\n";
}
| [reply] [d/l] |
Re: amalgamate similar lines
by Roy Johnson (Monsignor) on Jan 09, 2006 at 19:50 UTC
|
my @lines;
my %seen;
while (<DATA>) {
chomp;
my ($onetwo,$three) = $_=~/(.*)\|(.*)/;
if ($seen{$onetwo}) {
$lines[$seen{$onetwo} - 1] .= ",$three";
}
else {
$seen{$onetwo} = push @lines, $_;
}
}
print "$_\n" for @lines;
__DATA__
aaa|bbb|ccc
ddd|eee|fff
ddd|eee|xxxxx
hhh|iiii|jjjjjj
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |