in reply to Alternatives for index() ... substr() ?

zarath:

Generally, I use a combination of techniques to parse text files. The methods I use depend on how the structure of the text varies in the file.

When the structure is very regular, you can use substr or unpack to parse them. Unpack has the advantage where essentially lets me do multiple substring extractions in a single statement plus the ability to do a few simple type conversions. But if there are only a couple fields, I often fall back on split or substr.

When the data has variable structure, then I'll follow up with regular expressions to parse out the tougher bits. Regular expressions with capture groups are a very powerful method to let you quickly chop text into pieces. Here's a quick example:

use strict; use warnings; while (my $line = <DATA>) { next if $line =~ /^\s*$/; my ($date,$time,$severity,$msg) = split /\s+/, $line, 4; my ($r1_file, $r1_src, $r1_dst, $r2_file, $r2_src, $r2_dst); if ($msg =~ /^Error while copying (.*) from (.*) to (.*), error wa +s/) { ($r1_file, $r1_src, $r1_dst) = ($1, $2, $3); } if ($msg =~ /^Error while copying (.*?) from (.*?) to (.*?), error + was/) { ($r2_file, $r2_src, $r2_dst) = ($1, $2, $3); } print "FILE:\t<$r1_file>\n\t<$r2_file>\n"; print "SRC:\t<$r1_src>\n\t<$r2_src>\n"; print "DST:\t<$r1_dst>\n\t<$r2_dst>\n\n"; } __DATA__ 2017-11-16 11:42:20 FATAL: Error while copying MX000017105279_2448299. +1523788.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying Research Data from JOE +MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELArchive\B2B_E +LEK\ to \\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive from JOE\B2B_ELEK\ to \\ENCN +RW0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive to BOB\B2B_ELEK\ to \\ENCNRW +0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was:

See how easy the code with the regular expressions looks? It's pretty nice to be able to say:

if ($msg =~ /^Error while copying (.*) from (.*) to (.*), error wa +s/) { my ($filename, $source_dir, $dest_dir) = ($1, $2, $3); ... do something ... }

Perl can see that we're wanting to match an expression that starts with "Error while copying", followed by a chunk of text, followed by " from ", followed by more text, followed by " to ", yet more text, and ending up with ", error was". Since we used parenthesis to gather the three unspecified chunks of text, we have captured three strings. If the match was successful, we then assign the first captured chunk ($1) into $filename, the second captured chunk ($2) into $source_dir, etc.

While regular expressions can greatly simplify your code, you must test them thoroughly, or you can be surprised by the results:

$ perl pm_1204144.pl FILE: <MX000017105279_2448299.1523788.IN.EDI> <MX000017105279_2448299.1523788.IN.EDI> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <D:\EnecoEDIELArchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI> <MX000017105328_3626588.1523787.IN.EDI> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <D:\EnecoEDIELArchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <Research Data from JOE MX000017105328_3626588.1523787.IN.EDI> <Research Data> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <JOE MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELA +rchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELArchi +ve> <MX000017105328_3626588.1523787.IN.EDI> SRC: <JOE\B2B_ELEK\> <D:\EnecoEDIELArchive from JOE\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI> <MX000017105328_3626588.1523787.IN.EDI> SRC: <D:\EnecoEDIELArchive to BOB\B2B_ELEK\> <D:\EnecoEDIELArchive> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <BOB\B2B_ELEK\ to \\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\>

As you can see, some oddball data can confuse your regular expressions, so you need to test them carefully. Trying to catch everything with a single regular expression can get complicated, so don't be afraid to use several cases in different if/then/else branches.

...roboticus

When your only tool is a hammer, all problems look like your thumb.