comment on

We're migrating some legacy flat files to databases. The structure of the flat files is like that in the __DATA__ block shown below. Tags are short upper case strings followed by equal-sign and white space. Data fields are mostly unstructured text (we have verified that upper case TAGn strings never occur in the data). I want to capture both the tags and the data to build hashes.

Ideally, I want to use method 1 and create the regex on the fly from tags read from other files. As you can see, my attempt fails miserably (does not split where I want to split).

So I tried method 2 and made the regex manually. That one almost works, except that I end up with an extra (empty) value in $stuff2[0], which makes a subsequent %hash = @stuff2 break.

OK, the input is not that ugly, but I thought the solution would be easier. What am I missing?

use 5.010;
use strict;
use warnings;

$/ = undef;
my $data = <DATA>;

#method 1: build regex on the fly (read tags from files)
say '---------- method 1 ----------';
my @tags = qw(
                 TAG1
                 TAG2
                 TAG3
                 TAG4
            );

my $tags_re = join "|", @tags;
$tags_re = qr{ $tags_re };
say $tags_re;

my @stuff = split /($tags_re)=\s*/, $data;
say "#$_#"      for @stuff;

# method 2: static regex
say '---------- method 2 ----------';
my @stuff2 = split /(TAG1|TAG2|TAG2|TAG4)=\s*/, $data;
say "#$_#"      for @stuff2;

__DATA__
TAG1= data
TAG2= more data
TAG3= even more data that sometimes has = and
runs on to more
than one line
TAG4= still more
[download]

OUTPUT:

---------- method 1 ----------
(?-xism: TAG1|TAG2|TAG3|TAG4 )
#TAG1= data
#
#TAG2#
#more data
#
#TAG3#
#even more data that sometimes has = and
runs on to more
than one line
TAG4= still more
#
---------- method 2 ----------
##
#TAG1#
#data
#
#TAG2#
#more data
TAG3= even more data that sometimes has = and
runs on to more
than one line
#
#TAG4#
#still more
#
[download]

In reply to problems splitting ugly input data by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.