in reply to Re^2: RegEx question
in thread RegEx question
Well then, you've got a problem, but I'm going to propose a solution. First, I'll try to explain the problem.
If one label is "Programming Languages:", another label is "Author:", and another is "Date Created:", you clearly cannot count on the labels not having whitespace. And if your data fields are "C++, Java", "John", "2004-01-05 10:23...", you clearly cannot count on your data fields not containing whitespace. Your fields aren't of fixed width either. And your delimiter (the colon) appears mid-record, so it's more of an anchor than a delimiter, which doesn't help tremendously. What that leaves you with is this: No good way of determining where a data field ends, and where a new label starts. ......unless, of course.... unless you're lucky enough to know all the possible labels.
Maybe you could instead skim for known labels. That would be helpful. For example, if you know that the only labels in the text are "Programming Languages", "Author", and "Date Created", you could compose your regular expression like this:
my $labels = qr/Programming Languages|Author|Date Created/; my $re = qr/($labels):(.+?)(?=$labels|$)/; while( my( $label, $data ) = $text =~ m/$re/g ) { print "Label: $label\tData: $data\n"; }
This will capture the known label into $label on each iteration, and then the field following the label into $data. Each match stops as soon as the lookahead assertion finds the next known label, or the end of the string.
Dave
|
|---|