Pattern Matching - a Tricky one

John007 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Pattern Matching - a Tricky one by kyle (Abbot) on Jan 02, 2009 at 16:12 UTC
I don't recommend this solution because it uses string eval, it changes the order of the output data (and a little bit of the format), it reads the entire input file to do its work, and I wrote it without the aid of my usual morning chemicals. I'm not proud. The TIMTOWTDI made me do it. On the plus side, it doesn't do anything so perverse as try to parse a nested data structure with regular expressions, and it uses recursion, if you're into that sort of thing. Read more... (2 kB)	[reply] [d/l] [select]
Re: Pattern Matching - a Tricky one by kdj (Friar) on Jan 02, 2009 at 15:38 UTC
A solution using `seek` may be better, but I came up with this for finding the customer2 data: #!/usr/local/bin/perl use strict; use warnings; while ( <DATA> ) { next if ($_ !~ /bakjob(\d)_details/); print "bakjob$1:\n"; my $found = 0; while ($found < 2) { my $line = <DATA>; if ($line =~ /customer2/) { $found++; my @customer2_data = split(/=/, $line); print "$found- $customer2_data[1]"; } } } END; __DATA__ bakjob1_details = { credit = { customer1= 2000.0, customer2 = -1500.0 customer3 = 0.0, }, debit = { customer1= 50000.0, customer2 = -2000.0, customer3 = 0.0, } }, bakjob2_details = { credit = { customer1= 1000.0, customer2 = 200.0, customer3 = 500.0, }, debit = { customer1= 600.0, customer2 = 659.0, customer3 = 887.0, } } [download] Prints: `% testing.pl bakjob1: 1- -1500.0 2- -2000.0, bakjob2: 1- 200.0, 2- 659.0,` [download]	[reply] [d/l] [select]
Re^2: Pattern Matching - a Tricky one by linuxer (Curate) on Jan 02, 2009 at 15:55 UTC
`my $found = 0; while ($found < 2) { my $line = <DATA>; if ($line =~ /customer2/) { $found++; my @customer2_data = split(/=/, $line); print "$found- $customer2_data[1]"; } }` [download] Don't you run into a infinite loop with this? In case `$found` is less than 2 and `<DATA>` has reached its end-of-file, there will be no more matches and `$found` won't be incremented... So it will loop quite a while ;o)) So, maybe this would be better: `my $found = 0; while ( $found < 2 and my $line = <DATA> ) { if ( $line =~ m/customer2/ ) { # ... } }` [download] Even if this case only appears, if the input data is corrupt, I would try to avoid such construct.	[reply] [d/l] [select]
Re^3: Pattern Matching - a Tricky one by kdj (Friar) on Jan 02, 2009 at 16:32 UTC
Ah, good call on the infinite loop. Sometimes I forget that you can't trust your input. There's a bit more I could have done with this, such as detecting/capturing whether I was in a credit or a debit section, as that would presumably be good to know. But, this has the basics at least.	[reply]
Re: Pattern Matching - a Tricky one by pobocks (Chaplain) on Jan 02, 2009 at 16:08 UTC
I'd suggest matching after the unique id... This works, but is oogly. `#!/usr/bin/perl @chumptastic = <DATA>; $chumpstain = join "\n", @chumptastic; $chumpstain =~ m/bakjob2_details .+? credit .+? # The ? makes these minimal matches; # thus catching the nearest match. customer2\ =\ (-?\d+\.?\d),/sx; # Modified s to treat it as one string # (. matches newlines) print "$1\n"; __DATA__ # Your Data goes here` [download] Compressed, the regex looks like this: `/bakjob2_details.+?credit.+?customer2 = (-?\d+\.?\d),/s` `for(split(" ","tsuJ rehtonA lreP rekcaH")){print reverse . " "}print "\b.\n";`	[reply] [d/l] [select]
Re: Pattern Matching - a Tricky one by linuxer (Curate) on Jan 02, 2009 at 17:46 UTC
My idea for your problem looks like this: #!/usr/bin/perl use strict; use warnings; my $datafile = 'account.data'; my $searched_customer = 'customer2'; # modifications of values my %modify = ( credit => -100, debit => 250, ); my %indent = ( bakjob => ' 'x2, type => ' 'x4, customer => ' 'x6, ); # my %re = ( bakjob => qr{^$indent{bakjob}bakjob(\d+)_details}, type => qr{^ (debit\|credit) =}, ); #open my $fh, '<', $datafile or die $!; my $fh = DATA; # don't start before this line my $line_num = 4; while ( my $line = <$fh> ) { # only parse complete bakjob blocks after line $line_num if ( $. > $line_num && $line =~ m/$re{bakjob}/ .. $line =~ m/^$indent{bakjob}},?/ ) { # parse internal structure if ( $line =~ m/$re{type}/ ) { print $line; parse( $fh, $1 ); } else { print $line; } } # maybe this is wanted, if you need to get all lines printed: # if ( $. < $line_num ) { # print $line; # elsif ( $line =~ m/$re{bakjob}/ .. $line =~ m/^$indent{bakjob}}, +?/ ) { # # parse internal structure # .... # } # else { # print $line; # } } sub parse { my ( $fh, $type ) = @_; while ( my $line = <$fh> ) { chomp $line; my ( $customer, $value ) = split m{\s=\s*\|,}, $line; $customer =~ s/^\s+//; if ( $customer eq $searched_customer ) { printf "$indent{customer}%s = %.1f\n", $customer, $value+$ +modify{$type}; } else { print $line, "\n"; } last if $line =~ m/^$indent{type}},?/; } } __DATA__ bakjob1_details = { credit = { customer1= 2000.0, customer2 = -1500.0 customer3 = 0.0, }, debit = { customer1= 50000.0, customer2 = -2000.0, customer3 = 0.0, } }, bakjob2_details = { credit = { customer1= 1000.0, customer2 = 200.0, customer3 = 500.0, }, debit = { customer1= 600.0, customer2 = 659.0, customer3 = 887.0, } } [download] updates: typo fixed; thanks kyle	[reply] [d/l]
Re: Pattern Matching - a Tricky one by toolic (Bishop) on Jan 02, 2009 at 21:27 UTC
If you have control over the syntax of your text data file, read further; otherwise, stop reading now. I do not recognize the syntax of your data being any standard syntax. However, if you could format your data according to some standard syntax, such as XML, then you could leverage the power of existing parsers, such as the CPAN module XML::Twig	[reply]
Re: Pattern Matching - a Tricky one by kennethk (Abbot) on Jan 02, 2009 at 15:15 UTC
If you need to work with a specific known line number and have a known record width, perhaps you should consider using seek to control your input rather than reading in the entire file. This could also be used to estimate a start position to avoid a fair bit of reading time. Otherwise, a sample of your input file might be helpful (or is that the hash you provided? - if so, do you have the option of changing your input format?). Update: Upon further consideration, did you not mean a given line number by "specific line"? Is "bakjob1_details" a unique identifier? Does every record header (and only record headers) end with "_details"? It seems like this should all be done in a database, but assuming that the answer to all the above is yes, the following should function: Read more... (4 kB)	[reply] [d/l]
Re: Pattern Matching - a Tricky one by Perlbotics (Archbishop) on Jan 03, 2009 at 00:05 UTC
A fuzzy approach... Advantage: may handle large files (no slurping). 'Regexp's have to handle a single line only. Disadvantage: bakjob_details, credit/debit, customer2 have to appear on distinct lines (fragile, relies on proper input). use strict; my %replacement = ("bakjob1_details.debit" => 123.45, "bakjob2_details.credit" => -543.21 ); sub new_value { my ($key, $value) = @_; if (defined $replacement{$key}) { $value = $replacement{$key}; } else { warn "WARNING: '$key' unmodified ($value)\n"; } return $value; } my ($bjob, $payment); while ( <DATA> ) { $bjob = $1, next if /\b(bakjob[12]_details)\b/; $payment = $1, next if /\b(credit\|debit)\b/; # next unless /\bcustomer2\b/; #speed-up? s{^ (\s customer2 \s* = \s) (\S+?) (\s?,?\s*) $ } { $1 . new_value("$bjob.$payment",$2) . $3 }ex; } continue { print; } __DATA__ bakjob1_details = { credit = { customer1= 2000.0, customer2 = -1500.0 # comma missing? customer3 = 0.0, }, debit = { customer1= 50000.0, customer2 = -2000.0, customer3 = 0.0, } }, bakjob2_details = { credit = { customer1= 1000.0, customer2 = 200.0, customer3 = 500.0, }, debit = { customer1= 600.0, customer2 = 659.0, customer3 = 887.0, } } [download]	[reply] [d/l]