zarath:

Generally, I use a combination of techniques to parse text files. The methods I use depend on how the structure of the text varies in the file.

When the structure is very regular, you can use substr or unpack to parse them. Unpack has the advantage where essentially lets me do multiple substring extractions in a single statement plus the ability to do a few simple type conversions. But if there are only a couple fields, I often fall back on split or substr.

When the data has variable structure, then I'll follow up with regular expressions to parse out the tougher bits. Regular expressions with capture groups are a very powerful method to let you quickly chop text into pieces. Here's a quick example:

use strict; use warnings; while (my $line = <DATA>) { next if $line =~ /^\s*$/; my ($date,$time,$severity,$msg) = split /\s+/, $line, 4; my ($r1_file, $r1_src, $r1_dst, $r2_file, $r2_src, $r2_dst); if ($msg =~ /^Error while copying (.*) from (.*) to (.*), error wa +s/) { ($r1_file, $r1_src, $r1_dst) = ($1, $2, $3); } if ($msg =~ /^Error while copying (.*?) from (.*?) to (.*?), error + was/) { ($r2_file, $r2_src, $r2_dst) = ($1, $2, $3); } print "FILE:\t<$r1_file>\n\t<$r2_file>\n"; print "SRC:\t<$r1_src>\n\t<$r2_src>\n"; print "DST:\t<$r1_dst>\n\t<$r2_dst>\n\n"; } __DATA__ 2017-11-16 11:42:20 FATAL: Error while copying MX000017105279_2448299. +1523788.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying Research Data from JOE +MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELArchive\B2B_E +LEK\ to \\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive from JOE\B2B_ELEK\ to \\ENCN +RW0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive to BOB\B2B_ELEK\ to \\ENCNRW +0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was:

See how easy the code with the regular expressions looks? It's pretty nice to be able to say:

if ($msg =~ /^Error while copying (.*) from (.*) to (.*), error wa +s/) { my ($filename, $source_dir, $dest_dir) = ($1, $2, $3); ... do something ... }

Perl can see that we're wanting to match an expression that starts with "Error while copying", followed by a chunk of text, followed by " from ", followed by more text, followed by " to ", yet more text, and ending up with ", error was". Since we used parenthesis to gather the three unspecified chunks of text, we have captured three strings. If the match was successful, we then assign the first captured chunk ($1) into $filename, the second captured chunk ($2) into $source_dir, etc.

While regular expressions can greatly simplify your code, you must test them thoroughly, or you can be surprised by the results:

$ perl pm_1204144.pl FILE: <MX000017105279_2448299.1523788.IN.EDI> <MX000017105279_2448299.1523788.IN.EDI> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <D:\EnecoEDIELArchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI> <MX000017105328_3626588.1523787.IN.EDI> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <D:\EnecoEDIELArchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <Research Data from JOE MX000017105328_3626588.1523787.IN.EDI> <Research Data> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <JOE MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELA +rchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELArchi +ve> <MX000017105328_3626588.1523787.IN.EDI> SRC: <JOE\B2B_ELEK\> <D:\EnecoEDIELArchive from JOE\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI> <MX000017105328_3626588.1523787.IN.EDI> SRC: <D:\EnecoEDIELArchive to BOB\B2B_ELEK\> <D:\EnecoEDIELArchive> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <BOB\B2B_ELEK\ to \\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\>

As you can see, some oddball data can confuse your regular expressions, so you need to test them carefully. Trying to catch everything with a single regular expression can get complicated, so don't be afraid to use several cases in different if/then/else branches.

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re: Alternatives for index() ... substr() ? by roboticus
in thread Alternatives for index() ... substr() ? by zarath

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.