Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

text extraction question

by echoangel911 (Sexton)
on Dec 05, 2006 at 18:48 UTC ( [id://587940]=perlquestion: print w/replies, xml ) Need Help??

echoangel911 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno';
is there any easy way to extract NM, NM, CH, SW in one array and 8 8 512 no in another array based on this template or any similar formats?

Replies are listed 'Best First'.
Re: text extraction question
by ikegami (Patriarch) on Dec 05, 2006 at 19:06 UTC
    use strict; use warnings; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; my %type_matchers = ( NM => qr/\d+/, CH => qr/\d+/, SW => qr/yes|no/, ); my @field_names; my @field_types; while ($templateformat =~ /([^<]+)<([^>]+)>/g) { push(@field_names, $1); push(@field_types, $2); } # my $re = '^'; # $re .= "\Q$field_names[$_]\E((?:(?!\Q$field_names[$_+1]\E).)*)" # for 0..$#field_names-1; # $re .= "\Q$field_names[-1]\E(.*)\\z"; # $re = qr/$re/s; my $re = '^'; $re .= "\Q$field_names[$_]\E($type_matchers{$field_types[$_]})" for 0..$#field_names; $re = qr/$re/s; my @field_values = $inputexample =~ $re or die("Input \"$inputexample\" doesn't match the format defined by +template \"$templateformat\"\n"); local $, = "\t"; local $\ = "\n"; print(@field_names); print(@field_values);

    outputs

    w b cm sw 8 8 512 no

    Updated: Replaced the commented paragraph with the one that follows due to a better understanding of the question. Both give the same answer.

Re: text extraction question
by liverpole (Monsignor) on Dec 05, 2006 at 19:10 UTC
    Hi echoangel911,

    Is the following something like what you're looking for...?

    use strict; use warnings; use Data::Dumper; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; my @first_array = ($templateformat =~ /<([^>]*)>/g); my @second_array = ($inputexample =~ /\d+/g); printf "First array = %s\n", Dumper(\@first_array); printf "Second array = %s\n", Dumper(\@second_arrayt); # Displays: # # First array = $VAR1 = [ # 'NM', # 'NM', # 'CH', # 'SW' # ]; # # Second array = $VAR1 = [ # '8', # '8', # '512' # ];

    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      From the op

      and 8 8 512 no in another array

      I think you may have missed extracting the "no" in the second array, although from the scanty problem description it's not at all clear that it might be wanted.

      Cheers,

      JohnGG

        Whoops ... you're absolutely right!

        My brain directly converted it from "no" to "number(s)" on input (as in and 8 8 512 numbers), so I didn't even look to see if it was part of the string.

        I guess that validates blue_cowdawg's comments all the more, as far as the original question being a little bit vague.


        s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: text extraction question
by wulvrine (Friar) on Dec 05, 2006 at 19:24 UTC
    echoangel911
    The simplest way would be to use a regular expression, for example
    /w(.+)b(.+)cm(.+)sw(\S*)\s*$/
    Which would mean find anything (not assuming digits) after the w, the b, the cm, and the sw.
    The final (\S)\s*$ points to any non white space(\S) followed by any whitespace(\s*), followed by end of line ($).
    This will take anything after the 'sw' tag that ISNT white space but would leave any extra spacing (space/tabs etc) at the end of the line out of the match. The matches themselves are stored in the variables $1 thru $4.
    Here is an example

    #! /usr/bin/perl use strict; use warnings; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; if ($templateformat =~ /w(.+)b(.+)cm(.+)sw(\S*)\s*$/ ) { print "template found\n"; my $first = $1; my $second = $2; my $third = $3; my $fourth = $4; print "first=$first, second=$second, third=$third, fourth=$fourth\n" +; } if ($inputexample =~ /w(.+)b(.+)cm(.+)sw(\S*)\s*$/ ) { print "input found\n"; my $first = $1; my $second = $2; my $third = $3; my $fourth = $4; print "first=$first, second=$second, third=$third, fourth=$fourth\n" +; }

    Output is:

    template found
    first=<NM>, second=<NM>, third=<CH>, fourth=<SW>
    input found
    first=8, second=8, third=512, fourth=no

    I hope that helps!

    s&&VALKYRIE &&& print $_^q|!4 =+;' *|
Re: text extraction question
by blue_cowdawg (Monsignor) on Dec 05, 2006 at 18:56 UTC
        is there any easy way to extract NM, NM, CH, SW in one array and 8 8 512 no in another array based on this template or any similar formats?

    Looks like a job of regexen. What have you tried? Also, I'm not 100% sure I understand what your input really looks like. Your $templateformat and $inputexample sniglets don't much to clarify things. Are you apt to see NM, CH, SW tokens in one line of input and numerics in another?

    If you'd provide a larger sample set (not much larger) of the input you're trying to parse, it would be easier to help you...


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

      As I understand it,

      $templateformat is a list of fieldname<fieldtype> records. It defines the record format to which the data will adhere. As such, one shouldn't hardcode b, w, etc.

      The goal is to parse format strings such as $templateformat and use the info obtained to extract the values from records such as $inputexample.

      I had to take some guesses at what NM (number), CH (appears numerical??) and SW (switch) matches, but it can easily be changed.

      My solution.

Re: text extraction question
by throop (Chaplain) on Dec 05, 2006 at 21:19 UTC
    I'm going to assume that everything but the anglebrackets is reliably alphanumeric. I shoved $inputexample into $_ to unclutter the code.
    use strict; use warnings; my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; $_ = 'w8b8cm512swno'; my(@values, @names, $val); # In the example, @h will be ('w', 'NM', 'b', 'NM', 'cm', 'CH', 'sw', +'SW') my @h = $templateformat =~ /(\w+)<(\w+)>/g; for (my $ix=0; $ix + 1 < @h; $ix += 2){ push(@names, $h[$ix +1]); # Match up to the next piece of template if($ix + 2 < @h){ ($val, $_) = /$h[$ix](.+)($h[$ix +2].+)/ or die 'bad middle'} # or match to the end if there's no next piece else{ ($val) = /$h[$ix](.+)/ or die 'bad end'}; push(@values, $val)}; local($,, $\ ) = ("\t", "\n"); print(@names); print(@values);
    gives
    NM NM CH SW 8 8 512 no
    You might want to add more robust error-checking. I could have shoved the push into @values inside the if/else. I'd have saved creating the $val variable but I'd have duplicated the push.

    throop

Re: text extraction question
by MaxKlokan (Monk) on Dec 06, 2006 at 08:23 UTC
    This should do the trick:
    my $templateformat = 'w<NM>b<NM>cm<CH>sw<SW>'; my $inputexample = 'w8b8cm512swno'; my @splittemplate = split /[^A-Z]+/,$templateformat;
    my @splitinput = split /\D+/,$inputexample;
    my @splitinput = split /[^(\d)(no)]+/,$inputexample;
    Update:
    Include "no" as an element of the second array.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://587940]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-23 10:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found