echoangel911 has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks,
my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>';
my $inputexample = 'w8b8cm512swno';
is there any easy way to extract NM, NM, CH, SW in one array and 8 8 512 no in another array based on this template or any similar formats?
Re: text extraction question
by ikegami (Patriarch) on Dec 05, 2006 at 19:06 UTC
|
use strict;
use warnings;
my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>';
my $inputexample = 'w8b8cm512swno';
my %type_matchers = (
NM => qr/\d+/,
CH => qr/\d+/,
SW => qr/yes|no/,
);
my @field_names;
my @field_types;
while ($templateformat =~ /([^<]+)<([^>]+)>/g) {
push(@field_names, $1);
push(@field_types, $2);
}
# my $re = '^';
# $re .= "\Q$field_names[$_]\E((?:(?!\Q$field_names[$_+1]\E).)*)"
# for 0..$#field_names-1;
# $re .= "\Q$field_names[-1]\E(.*)\\z";
# $re = qr/$re/s;
my $re = '^';
$re .= "\Q$field_names[$_]\E($type_matchers{$field_types[$_]})"
for 0..$#field_names;
$re = qr/$re/s;
my @field_values = $inputexample =~ $re
or die("Input \"$inputexample\" doesn't match the format defined by
+template \"$templateformat\"\n");
local $, = "\t";
local $\ = "\n";
print(@field_names);
print(@field_values);
outputs
w b cm sw
8 8 512 no
Updated: Replaced the commented paragraph with the one that follows due to a better understanding of the question. Both give the same answer.
| [reply] [d/l] [select] |
Re: text extraction question
by liverpole (Monsignor) on Dec 05, 2006 at 19:10 UTC
|
use strict;
use warnings;
use Data::Dumper;
my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>';
my $inputexample = 'w8b8cm512swno';
my @first_array = ($templateformat =~ /<([^>]*)>/g);
my @second_array = ($inputexample =~ /\d+/g);
printf "First array = %s\n", Dumper(\@first_array);
printf "Second array = %s\n", Dumper(\@second_arrayt);
# Displays:
#
# First array = $VAR1 = [
# 'NM',
# 'NM',
# 'CH',
# 'SW'
# ];
#
# Second array = $VAR1 = [
# '8',
# '8',
# '512'
# ];
s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
| [reply] [d/l] |
|
| [reply] |
|
Whoops ... you're absolutely right!
My brain directly converted it from "no" to "number(s)" on input (as in and 8 8 512 numbers), so I didn't even look to see if it was part of the string.
I guess that validates blue_cowdawg's comments all the more, as far as the original question being a little bit vague.
s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
| [reply] |
|
Re: text extraction question
by wulvrine (Friar) on Dec 05, 2006 at 19:24 UTC
|
echoangel911
The simplest way would be to use a regular expression, for example
/w(.+)b(.+)cm(.+)sw(\S*)\s*$/
Which would mean find anything (not assuming digits) after the w, the b, the cm, and the sw.
The final (\S)\s*$ points to any non white space(\S) followed by any whitespace(\s*), followed by end of line ($). This will take anything after the 'sw' tag that ISNT white space but would leave any extra spacing (space/tabs etc) at the end of the line out of the match. The matches themselves are stored in the variables $1 thru $4.
Here is an example
#! /usr/bin/perl
use strict;
use warnings;
my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>';
my $inputexample = 'w8b8cm512swno';
if ($templateformat =~ /w(.+)b(.+)cm(.+)sw(\S*)\s*$/ ) {
print "template found\n";
my $first = $1;
my $second = $2;
my $third = $3;
my $fourth = $4;
print "first=$first, second=$second, third=$third, fourth=$fourth\n"
+;
}
if ($inputexample =~ /w(.+)b(.+)cm(.+)sw(\S*)\s*$/ ) {
print "input found\n";
my $first = $1;
my $second = $2;
my $third = $3;
my $fourth = $4;
print "first=$first, second=$second, third=$third, fourth=$fourth\n"
+;
}
Output is:
template found
first=<NM>, second=<NM>, third=<CH>, fourth=<SW>
input found
first=8, second=8, third=512, fourth=no
I hope that helps!
s&&VALKYRIE &&& print $_^q|!4 =+;' *|
| [reply] [d/l] |
Re: text extraction question
by blue_cowdawg (Monsignor) on Dec 05, 2006 at 18:56 UTC
|
is there any easy way to extract NM, NM, CH, SW in one array and 8 8 512 no in another array based on this template or any similar formats?
Looks like a job of regexen. What have you tried? Also, I'm
not 100% sure I understand what your input really looks like. Your $templateformat and
$inputexample sniglets don't much to
clarify things. Are you apt to see NM, CH, SW tokens in
one line of input and numerics in another?
If you'd provide a larger sample set (not much larger)
of the input you're trying to parse, it would be easier
to help you...
Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
| [reply] [d/l] [select] |
|
As I understand it,
$templateformat is a list of fieldname<fieldtype> records. It defines the record format to which the data will adhere. As such, one shouldn't hardcode b, w, etc.
The goal is to parse format strings such as $templateformat and use the info obtained to extract the values from records such as $inputexample.
I had to take some guesses at what NM (number), CH (appears numerical??) and SW (switch) matches, but it can easily be changed.
My solution.
| [reply] [d/l] [select] |
Re: text extraction question
by throop (Chaplain) on Dec 05, 2006 at 21:19 UTC
|
I'm going to assume that everything but the anglebrackets is reliably alphanumeric. I shoved $inputexample into $_ to unclutter the code.
use strict;
use warnings;
my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>';
$_ = 'w8b8cm512swno';
my(@values, @names, $val);
# In the example, @h will be ('w', 'NM', 'b', 'NM', 'cm', 'CH', 'sw',
+'SW')
my @h = $templateformat =~ /(\w+)<(\w+)>/g;
for (my $ix=0; $ix + 1 < @h; $ix += 2){
push(@names, $h[$ix +1]);
# Match up to the next piece of template
if($ix + 2 < @h){
($val, $_) = /$h[$ix](.+)($h[$ix +2].+)/ or die 'bad middle'}
# or match to the end if there's no next piece
else{
($val) = /$h[$ix](.+)/ or die 'bad end'};
push(@values, $val)};
local($,, $\ ) = ("\t", "\n");
print(@names);
print(@values);
givesNM NM CH SW
8 8 512 no
You might want to add more robust error-checking. I could have shoved the push into @values inside the if/else. I'd have saved creating the $val variable but I'd have duplicated the push.throop
| [reply] [d/l] [select] |
Re: text extraction question
by MaxKlokan (Monk) on Dec 06, 2006 at 08:23 UTC
|
This should do the trick:
my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>';
my $inputexample = 'w8b8cm512swno';
my @splittemplate = split /[^A-Z]+/,$templateformat;
my @splitinput = split /\D+/,$inputexample;
my @splitinput = split /[^(\d)(no)]+/,$inputexample;
Update:
Include "no" as an element of the second array. | [reply] [d/l] [select] |
|
|