apocalyptica has asked for the wisdom of the Perl Monks concerning the following question:

yes, this is the only type of question I ever ask.

Suppose I have a line that begins with nine whitespaces followed by a particular value, followed by 10 more whitespaces, then another value, so it looks like this:
         none          lt2dpmnt
(This is merely an example; this occurs many times throughout the data file, and I need get out each and every instance of it to be put in a table.)

I want to get those two values and put them into variables. I have tried this in a variety of ways:
(\$meanH1, $meanH2) = ($1, $2) if /^s{9}/;
This didn't work, so I tried this:
(\$meanH1, $meanH2) = ($1, $2) if /^s+/;
This too did not work. Really, it seems almost laughably simple, and I am rather embarassed by my inability to get it right. Little help?

Replies are listed 'Best First'.
Re: regular expressions query
by shemp (Deacon) on Jun 30, 2004 at 17:54 UTC
    In a regex, whitespace is \s
    An 's' matches the literal character 's'. so one way to do it would be:
    ($thing1, $thing2) = ($1, $2) if /^\s{9}(\S+)\s{10}(\S+)/;
    You need to include the things you're trying to capture, i.e. the (\S+)
    \S means anything except whitespace.

    BUT, this is much better suited to using split:\
    ($thing1, $thing2) = split;
    Now using split without any args is a special case that splits $_ on /\s+/
    You should look into how split works, i think your post the other day would have worked better with split also.
Re: regular expressions query
by Anonymous Monk on Jun 30, 2004 at 17:54 UTC

    Let us assume that this string is in $_

    ($meanH1, $meanH2) = /^\s{9}(.*?)\s{10}(.*?)$/; # Or you may be able to generalize it a bit more with: ($meanH1, $meanH2) = /^\s*(.*?)\s*(.*?)\s*$/; # Or, if the second option is true, you could event use: ($meanH1, $meanH2) = split; # Which is a short hand version for ($meanH1, $meanH2) = split /\s+/, $_;

    All of the above are rather basic examples of regex and are well documented in perlre (perldoc or perldoc.com)

    Ted
Re: regular expressions query
by hmerrill (Friar) on Jun 30, 2004 at 18:29 UTC
    Like most things in Perl, there are usually many different ways to accomplish the same thing. Others have given good suggestions using regular expressions, split, etc. But I don't think anyone has mentioned unpack yet.

    If your situation involves fixed length records where each field occupies the same columns on each record, then unpack will work for you.

    The Perl Cookbook p.297 has recipe 8.15 titled "Reading Fixed-Length Records" which describes using unpack:

    # $RECORDSIZE is the length of a record, in bytes. # $TEMPLATE is teh unpack template for the record # FILE is the file to read from # @FIELDS is an array, one element per field until ( eof(FILE) ) { read(FILE, $record, $RECORDSIZE) == $RECORDSIZE or die "short read\n"; @FIELDS = unpack($TEMPLATE, $record); }
    Now to relate that to your example (I'm on Windows XP):
    #!perl -w use strict; my $record = " none lt2dpmnt"; print "\$record = [$record]\n"; my @FIELDS = unpack('a9a4a10a8', $record); foreach (@FIELDS) { print "field=[$_]\n"; }
    Produces this output:
    C:\DOCUME~1\hmerrill.000\TEST_P~1>test_unpack.pl $record = [ none lt2dpmnt] field=[ ] field=[none] field=[ ] field=[lt2dpmnt]
    Again, this only works if you know that every record is the same length, and each field in the record occupies the same columns. "perldoc -f pack" and "perldoc -f unpack" for more information.

    HTH.

      Greetings all,
      Just an FYI you can use an 'x' in your unpack template to remove the spaces ('x'='A null byte.'), that is unless you want the spaces.
      so
      my @FIELDS = unpack('a9a4a10a8', $record);

      becomes
      my @FIELDS = unpack('x9a4x10a8', $record);

      Given your example code above the output would be:
      $record = [ none lt2dpmnt] field=[none] field=[lt2dpmnt]


      -injunjoel
      "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: regular expressions query
by sweetblood (Prior) on Jun 30, 2004 at 17:55 UTC
    my ($meanH1, $meanH2) = split /\s+/

    check out perldoc -f split

    HTH

    Sweetblood

      This will not get rid of the leading whitespace on those lines (it returns a null field as the first field). But if you use ' ' instead it should work fine. That is:
      my ($meanH1,$meanH2) = split ' ';
      as per the documentation:

      A split on /\s+/ is like a split(' ') except that any leading whitespace produces a null first field.

      -enlil

Re: regular expressions query
by Enlil (Parson) on Jun 30, 2004 at 18:00 UTC
    You might want to look over perlretut and perlre. In order to use the $1,$2,$3 ... variables you have to have a matching regular expression and you need capturing parens. Anyhow if all lines are in that format you can use:
    ($var1, $var2) = ($1,$2) if /(\S+)\s+(\S+)/;
    if the lines are not the same throughout the file and you and you need to be more specific:
    ($var1,$var2) = ($1,$2) if /^\s{9}(\S+)\s{10}(\S+)/;

    -enlil

Re: regular expressions query
by davido (Cardinal) on Jun 30, 2004 at 18:02 UTC
    my ( $meanH1, $meanH2 ); ( $meanH1, $meanH2 ) = ( $1, $2 ) if $line =~ m/^\s{9}(\S+)\s{10}(\S+)/;

    You're correct to be checking the success of your matching. I don't like solutions that skip past this important step.

    The preceeding example will look for (and skip past) the leading nine whitespaces. It will then capture all contiguous non-whitespace. It will then look for and skip past the next ten whitespaces. It will then capture all remaining contiguous non-whitespace. If there's anything else on the line (like a trailing newline) it will be ignored.


    Dave

Re: regular expressions query
by apocalyptica (Acolyte) on Jun 30, 2004 at 20:15 UTC
    Hmmm... These are all excellent ideas, but none of them seem to be quite working for me. Another way I was thinking about doing this is to look at the end of the line before this one in the data file: each line before the one where I want to cull data from ends with the text "VALUES FOR". I tried this:
    ($meanH1, $meanH2) = ($1, $2) if VALUES FOR$\s+(\S+)\s+(\S+)/;
    But it doesn't seem to work. From my understanding, the \s+ should also match for newline feeds in addition to whitespace, correct? Any suggestions?
      I just tried the following:
      #!/usr/local/perl $test = " foo bar"; ($var1, $var2) = ($1, $2) if ($test =~ /\s+(\S+)\s+(\S+)/); print "Var1: $var1\nVar2: $var2"; exit;

      ...and it grabbed the text out and printed fine, so I'm not sure what you mean when you say none of the suggestions are working for you, can you be more specific?

      If you're ever lost and need directions, ask the guy on the motorcycle.

      Post your code and a couple of lines of data. The examples given _do_ work. Perhaps we are missing a part of the problem?

      #!/usr/bin/perl while ( <DATA>) { ($a,$b) = split ' '; print "split '$a','$b'\n"; my ($a,$b) = $_ =~ /(\S+)\s+(\S+)/; print "match '$a','$b'\n"; } __DATA__ none bing some bong any bang

      output:

      split 'none','bing' match 'none','bing' split 'some','bong' match 'some','bong' split 'any','bang' match 'any','bang'

      If you are matching across newlines, are you not reading line by line? If the data is in one big string, perhaps you want something more like:

      #!/usr/bin/perl my $txt = ' none bing some bong any bang '; while ( $txt =~ /^ {9}(\S+) {10}(\S+)\s*$/mg ) { print "'$1' '$2'\n"; }

      qq

Re: regular expressions query
by ercparker (Hermit) on Jun 30, 2004 at 22:48 UTC
    apocalyptica But it doesn't seem to work. From my understanding, the \s+ should also match for newline feeds in addition to whitespace, correct? Any suggestions?

    regarding your question as to what \s will match
    it will match whitespace including tabs, carriage returns, newlines and form feeds
Re: regular expressions query
by rupesh (Hermit) on Jul 01, 2004 at 06:16 UTC