Weisshaupt has asked for the wisdom of the Perl Monks concerning the following question:

I have a string that looks like this:
122     Genesis Chamber             Mark Tedin                A     U
There are five elements, delimited by an arbitrary number of spaces. The first element is a series of digits, the second element either consists of one word, or several words, separated by spaces. The words could also contain non-word characters, like apostrophes, exclamations marks, etc. The same goes for the third element. The fourth and the fifth are always a single character long.
What I want to do here is to extract the second element of the string, using regexp. I can do it fine, if the second element consisted of just one whole word, no spaces. But I can't quite figure out how to do it with an arbitrary number of words.

Replies are listed 'Best First'.
Re: Abritrary multiple spaces as delimiter
by BUU (Prior) on Mar 15, 2004 at 08:20 UTC
    Well, assuming the words in the second and third element are seperated by only one space but the elements themselves are seperated by two or, it looks like split/\s\s+/ would do basically what you want. If thats not the case, you have basically have no way guarunteed to work. If I *had* to solve that problem, I'd probably try to extract the first element and the last two elements, then guess a lot for the middle ones.
      I decided to go with this solution. The columns are fixed width, most likely tabs converted into spaces, but since there are no examples, in the file I'm parsing, of two fields being separated by less than two spaces, I went with splitting. Also, this worked better, since this would retain all the funny characters that a regexp might miss, like 'Æ' and such.
      A link to the complete file I'm parsing.
      Just out of curiosity. Say I have several thousand of these files to parse, which would be faster, splitting, regexp or the unpack solution?

      Was it ok to post my reply here? I'm not up on perlmonk posting etiquette. Moderators moderate.

        Probably unpack. Only a benchmark will tell for sure, though. Even so, unless you have to do this so many times that it actually matters, you shouldn't care. Readability and maintainability comes first; programmer time is much more expensive than computer time.

        In your case I'd pick the unpack solution simply because it's the most clearly self documenting. The split solution does not convey all the assumptions about your input, even though it works.

        Makeshifts last the longest.

Re: Abritrary multiple spaces as delimiter
by Corion (Patriarch) on Mar 15, 2004 at 08:20 UTC

    A simple regex won't work, as it would have to know the difference between "two spaces between words" and "two spaces at the end of the first part". You could either claim that "two or more spaces" delimit the items, but that will fall down as soon as you have one item that has the maximum allowed length. You could use a limited-length match like the following:

    # 122 Genesis Chamber Mark Tedin A +U $string =~ /^(\d+)\s{2,}(.{28})(.{26})(.)\s+(.)$/;

    But that is a very tedious way of constructing and using a regular expression when unpack can do the same:

    my ($num,$title,$author,$flag1,$flag2) = unpack "A8A27A26A7A",$str;

      You're slurping the padding into the values.. do you really want to? Also, a few more spaces help make things pleasing to the eye:

      my ($num, $title, $author, $flag1, $flag2) = unpack "A7 x1 A27 x1 A25 x1 A1 x5 A1", $str;
      Actually I'd probably write a simple subroutine such that I could say
      unpack_fields( $str, \my $num => 7, padding => 1, \my $title => 27, padding => 1, \my $author => 25, padding => 1, \my $flag1 => 1, padding => 5, \my $flag2 => 1, );

      This also clearly documents the data format.

      Update: see Yet another unpack wrapper: flatfile databases with fixed width fields.

      Makeshifts last the longest.

Re: Abritrary multiple spaces as delimiter
by mirod (Canon) on Mar 15, 2004 at 08:56 UTC

    What do you mean exactly by "arbitrary"? If it means a constant, known at run-time, number of spaces, then it's easy, just do a split on that number of spaces. If you mean random, unknown and different from line to line, then the answer is easy: with the information you give, you can't. In your example, there is no way to know where to split Genesis Chamber Mark Tedin in 2 strings.

    Now all is not lost, you can try various heuristics to figure out what to do, but remember # 11953 Of course, this is a heuristic, which is a fancy way of saying that it doesn't work (from MJD):

    #!/usr/bin/perl -w use strict; while( <DATA>) { if( m{^\d+ # initial digits (1rst field) \s+ # separating space(s) (.*?) # the 2cd and 3rd field \s+ # separating space(s) \S # 1 (non-space) character (4th field) \s+ # separating space(s) \S # 1 (non-space) character (5th field) \s* # you might or might not want to allow extra spaces + a the end of the line $}x ) { my $fields= $1; my @fields; # @fields= heuristic1( $fields) || heuristic2( $fields); # does not work as the || seems to put the first function call + in scalar mode, weird @fields= heuristic1( $fields); unless( @fields) { @fields= heuristic2( $fields); } if( @fields) { print "field 2: '$fields[0]' - field 3: '$fields[1]'\n"; } else { warn "cannot extract field 2/3 of line $. Fields are: '$fi +elds'\n"; } } else { warn "can't parse line $."; } } # only 2 words, that's easy sub heuristic1 { my $fields= shift; my @fields= split /\s+/, $fields; if( @fields == 2) { return @fields } else { return; } } # more than one space separates the 2 fields sub heuristic2 { my $fields= shift; my @fields= split /\s\s+/, $fields; if( @fields == 2) { return @fields } else { return; } } __DATA__ 122 Genesis Chamber Mark Tedin A U 123 f2w f3w 4 5 123 f2w1 f2w2 f3w 4 5 123 f2w1 f2w2 f3w1 f3w2 4 5 123 f2w f3w 4 99 123 f2w1 f2w2 f3w1 f3w2 4 5
Re: Abritrary multiple spaces as delimiter
by davido (Cardinal) on Mar 15, 2004 at 08:24 UTC
    If your "arbitrary number of spaces" is at least two, and your second element never contains two consecutive spaces, you're well on your way.

    my $string = "122 Genesis Chamber Mark Tedin A U "; my ( $second ) = $string =~ m/\s{2,}(.+?)\s{2,}/; print "$second\n";

    That works if, as I stated, you are sure that the second field is surrounded by at least two whitespace characters.

    Update: I want to point out that I specifically avoided the intense temptation to use unpack for two reasons: First, though the thought crossed my mind that this might be a dataset with aligned columns (fixed-width fields), the OP didn't specify that to be the case, and so since I had to make some assumption about the data, I chose the two-or-more space delimiter assumption rather than the fixed-width field assumption. My second reason was that the OP asked to solve the problem with a regexp, unpack wasn't on the table.

    However, if it turns out that we are dealing with fixed-width fields, the regexp solution is simply the wrong tool for the job, and unpack is the right tool. My advice to the OP is to use the unpack solution if the data is in a fixed-width field format, or to consider one of the regexp solutions provided if the data's format is non-fixed-width.


    Dave

Re: Abritrary multiple spaces as delimiter
by BrowserUk (Patriarch) on Mar 15, 2004 at 08:36 UTC

    This will work *if* you can guarentee that each of your two variable length fields will only contain single spaces?

    my $s = '122 Genesis Chamber Mark Tedin + A U'; print join '|', $s =~ m[^(\S+)\s+\b(.*)\b\s{2,}\b(.*)\b\s{2,}(\S)\s+(\ +S)$]; 122|Genesis Chamber|Mark Tedin|A|U
    m[^ # With the $, match the whole line (\S+) # The first field contains no spaces \s+ # and is separated from the next by at least 1 \b(.*)\b # 2nd field starts/stops on a word boundary # it can contain anything, \s{2,} # but only single concecutive spaces. \b(.*)\b # 3rd field is similarly defined \s{2,} # Again, 2 or more spaces defined the end of field (\S) # 4th Single non-space char \s+ # 1 or more spaces (\S) # 5th Single char. $]x

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: Abritrary multiple spaces as delimiter
by Abigail-II (Bishop) on Mar 15, 2004 at 09:50 UTC
    So, how do you determine the second element is "Genesis Chamber" and not "Genesis", with the third element being "Chamber Mark Tedin"? Or perhaps the second element is "Genesis Chamber Mark", and the third is "Tedin"?

    Now, if elements are separated by at least two spaces, and between words belonging to the same element there's just a single space, you could split on /\s{2,}/. But if the delimiter is "arbitrary spaces", then the problem is not uniquely solvable.

    Abigail

Re: Abritrary multiple spaces as delimiter
by guha (Priest) on Mar 15, 2004 at 08:20 UTC

    How would you know where the boundary between the second and the third element is?

    Or is it a fixed (positional) format?