text extraction question

echoangel911 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: text extraction question by ikegami (Patriarch) on Dec 05, 2006 at 19:06 UTC
use strict; use warnings; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; my %type_matchers = ( NM => qr/\d+/, CH => qr/\d+/, SW => qr/yes\|no/, ); my @field_names; my @field_types; while ($templateformat =~ /([^<]+)<([^>]+)>/g) { push(@field_names, $1); push(@field_types, $2); } # my $re = '^'; # $re .= "\Q$field_names[$_]\E((?:(?!\Q$field_names[$_+1]\E).))" # for 0..$#field_names-1; # $re .= "\Q$field_names[-1]\E(.)\\z"; # $re = qr/$re/s; my $re = '^'; $re .= "\Q$field_names[$_]\E($type_matchers{$field_types[$_]})" for 0..$#field_names; $re = qr/$re/s; my @field_values = $inputexample =~ $re or die("Input \"$inputexample\" doesn't match the format defined by +template \"$templateformat\"\n"); local $, = "\t"; local $\ = "\n"; print(@field_names); print(@field_values); [download] outputs `w b cm sw 8 8 512 no` [download] Updated: Replaced the commented paragraph with the one that follows due to a better understanding of the question. Both give the same answer.	[reply] [d/l] [select]
Re: text extraction question by liverpole (Monsignor) on Dec 05, 2006 at 19:10 UTC
Hi echoangel911, Is the following something like what you're looking for...? `use strict; use warnings; use Data::Dumper; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; my @first_array = ($templateformat =~ /<([^>]*)>/g); my @second_array = ($inputexample =~ /\d+/g); printf "First array = %s\n", Dumper(\@first_array); printf "Second array = %s\n", Dumper(\@second_arrayt); # Displays: # # First array = $VAR1 = [ # 'NM', # 'NM', # 'CH', # 'SW' # ]; # # Second array = $VAR1 = [ # '8', # '8', # '512' # ];` [download] s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l]
Re^2: text extraction question by johngg (Canon) on Dec 05, 2006 at 19:51 UTC
From the op and 8 8 512 no in another array I think you may have missed extracting the "no" in the second array, although from the scanty problem description it's not at all clear that it might be wanted. Cheers, JohnGG	[reply]
Re^3: text extraction question by liverpole (Monsignor) on Dec 05, 2006 at 19:58 UTC
Whoops ... you're absolutely right! My brain directly converted it from "no" to "number(s)" on input (as in and 8 8 512 numbers), so I didn't even look to see if it was part of the string. I guess that validates blue_cowdawg's comments all the more, as far as the original question being a little bit vague. s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply]
Re^4: text extraction question by johngg (Canon) on Dec 05, 2006 at 20:34 UTC
Re: text extraction question by wulvrine (Friar) on Dec 05, 2006 at 19:24 UTC
echoangel911 The simplest way would be to use a regular expression, for example /w(.+)b(.+)cm(.+)sw(\S)\s$/ Which would mean find anything (not assuming digits) after the w, the b, the cm, and the sw. The final (\S)\s$ points to any non white space(\S) followed by any whitespace(\s), followed by end of line ($). This will take anything after the 'sw' tag that ISNT white space but would leave any extra spacing (space/tabs etc) at the end of the line out of the match. The matches themselves are stored in the variables $1 thru $4. Here is an example #! /usr/bin/perl use strict; use warnings; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; if ($templateformat =~ /w(.+)b(.+)cm(.+)sw(\S)\s$/ ) { print "template found\n"; my $first = $1; my $second = $2; my $third = $3; my $fourth = $4; print "first=$first, second=$second, third=$third, fourth=$fourth\n" +; } if ($inputexample =~ /w(.+)b(.+)cm(.+)sw(\S)\s$/ ) { print "input found\n"; my $first = $1; my $second = $2; my $third = $3; my $fourth = $4; print "first=$first, second=$second, third=$third, fourth=$fourth\n" +; } [download] Output is: template found first=<NM>, second=<NM>, third=<CH>, fourth=<SW> input found first=8, second=8, third=512, fourth=no I hope that helps! s&&VALKYRIE &&& print $_^q\|!4 =+;' *\|	[reply] [d/l]
Re: text extraction question by blue_cowdawg (Monsignor) on Dec 05, 2006 at 18:56 UTC
is there any easy way to extract NM, NM, CH, SW in one array and 8 8 512 no in another array based on this template or any similar formats? Looks like a job of regexen. What have you tried? Also, I'm not 100% sure I understand what your input really looks like. Your `$templateformat` and `$inputexample` sniglets don't much to clarify things. Are you apt to see NM, CH, SW tokens in one line of input and numerics in another? If you'd provide a larger sample set (not much larger) of the input you're trying to parse, it would be easier to help you... Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply] [d/l] [select]
Re^2: text extraction question by ikegami (Patriarch) on Dec 05, 2006 at 20:29 UTC
As I understand it, `$templateformat` is a list of `fieldname<fieldtype>` records. It defines the record format to which the data will adhere. As such, one shouldn't hardcode `b`, `w`, etc. The goal is to parse format strings such as `$templateformat` and use the info obtained to extract the values from records such as `$inputexample`. I had to take some guesses at what `NM` (number), `CH` (appears numerical??) and `SW` (switch) matches, but it can easily be changed. My solution.	[reply] [d/l] [select]
Re: text extraction question by throop (Chaplain) on Dec 05, 2006 at 21:19 UTC
I'm going to assume that everything but the anglebrackets is reliably alphanumeric. I shoved `$inputexample` into $_ to unclutter the code. use strict; use warnings; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; $_ = 'w8b8cm512swno'; my(@values, @names, $val); # In the example, @h will be ('w', 'NM', 'b', 'NM', 'cm', 'CH', 'sw', +'SW') my @h = $templateformat =~ /(\w+)<(\w+)>/g; for (my $ix=0; $ix + 1 < @h; $ix += 2){ push(@names, $h[$ix +1]); # Match up to the next piece of template if($ix + 2 < @h){ ($val, $_) = /$h[$ix](.+)($h[$ix +2].+)/ or die 'bad middle'} # or match to the end if there's no next piece else{ ($val) = /$h[$ix](.+)/ or die 'bad end'}; push(@values, $val)}; local($,, $\ ) = ("\t", "\n"); print(@names); print(@values); [download] gives `NM NM CH SW 8 8 512 no` [download] You might want to add more robust error-checking. I could have shoved the push into @values inside the if/else. I'd have saved creating the $val variable but I'd have duplicated the push. throop	[reply] [d/l] [select]
Re: text extraction question by MaxKlokan (Monk) on Dec 06, 2006 at 08:23 UTC
This should do the trick: `my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; my @splittemplate = split /[^A-Z]+/,$templateformat;` [download] ~~`my @splitinput = split /\D+/,$inputexample;`~~ `my @splitinput = split /[^(\d)(no)]+/,$inputexample;` [download] Update: Include "no" as an element of the second array.	[reply] [d/l] [select]


Perl: the Markov chain saw
	PerlMonks