Re: Abritrary multiple spaces as delimiter
by BUU (Prior) on Mar 15, 2004 at 08:20 UTC
|
Well, assuming the words in the second and third element are seperated by only one space but the elements themselves are seperated by two or, it looks like split/\s\s+/ would do basically what you want. If thats not the case, you have basically have no way guarunteed to work. If I *had* to solve that problem, I'd probably try to extract the first element and the last two elements, then guess a lot for the middle ones. | [reply] [d/l] |
|
|
I decided to go with this solution. The columns are fixed width, most likely tabs converted into spaces, but since there are no examples, in the file I'm parsing, of two fields being separated by less than two spaces, I went with splitting. Also, this worked better, since this would retain all the funny characters that a regexp might miss, like 'Æ' and such.
A link to the complete file I'm parsing.
Just out of curiosity. Say I have several thousand of these files to parse, which would be faster, splitting, regexp or the unpack solution?
Was it ok to post my reply here? I'm not up on perlmonk posting etiquette. Moderators moderate.
| [reply] |
|
|
Probably unpack. Only a benchmark will tell for sure, though. Even so, unless you have to do this so many times that it actually matters, you shouldn't care. Readability and maintainability comes first; programmer time is much more expensive than computer time.
In your case I'd pick the unpack solution simply because it's the most clearly self documenting. The split solution does not convey all the assumptions about your input, even though it works.
Makeshifts last the longest.
| [reply] |
Re: Abritrary multiple spaces as delimiter
by Corion (Patriarch) on Mar 15, 2004 at 08:20 UTC
|
A simple regex won't work, as it would have to know the difference between "two spaces between words" and "two spaces at the end of the first part". You could either claim that "two or more spaces" delimit the items, but that will fall down as soon as you have one item that has the maximum allowed length. You could use a limited-length match like the following:
# 122 Genesis Chamber Mark Tedin A
+U
$string =~ /^(\d+)\s{2,}(.{28})(.{26})(.)\s+(.)$/;
But that is a very tedious way of constructing and using a regular expression when unpack can do the same:
my ($num,$title,$author,$flag1,$flag2) = unpack "A8A27A26A7A",$str;
| [reply] [d/l] [select] |
|
|
my ($num, $title, $author, $flag1, $flag2) =
unpack "A7 x1 A27 x1 A25 x1 A1 x5 A1", $str;
Actually I'd probably write a simple subroutine such that I could say
unpack_fields(
$str,
\my $num => 7,
padding => 1,
\my $title => 27,
padding => 1,
\my $author => 25,
padding => 1,
\my $flag1 => 1,
padding => 5,
\my $flag2 => 1,
);
This also clearly documents the data format.
Update: see Yet another unpack wrapper: flatfile databases with fixed width fields.
Makeshifts last the longest.
| [reply] [d/l] [select] |
Re: Abritrary multiple spaces as delimiter
by mirod (Canon) on Mar 15, 2004 at 08:56 UTC
|
What do you mean exactly by "arbitrary"? If it means a constant, known at run-time, number of spaces, then it's easy, just do a split on that number of spaces. If you mean random, unknown and different from line to line, then the answer is easy: with the information you give, you can't. In your example, there is no way to know where to split Genesis Chamber Mark Tedin in 2 strings.
Now all is not lost, you can try various heuristics to figure out what to do, but remember # 11953 Of course, this is a heuristic, which is a fancy way of saying that it doesn't work (from MJD):
#!/usr/bin/perl -w
use strict;
while( <DATA>)
{ if( m{^\d+ # initial digits (1rst field)
\s+ # separating space(s)
(.*?) # the 2cd and 3rd field
\s+ # separating space(s)
\S # 1 (non-space) character (4th field)
\s+ # separating space(s)
\S # 1 (non-space) character (5th field)
\s* # you might or might not want to allow extra spaces
+ a the end of the line
$}x
)
{ my $fields= $1;
my @fields;
# @fields= heuristic1( $fields) || heuristic2( $fields);
# does not work as the || seems to put the first function call
+ in scalar mode, weird
@fields= heuristic1( $fields);
unless( @fields) { @fields= heuristic2( $fields); }
if( @fields)
{ print "field 2: '$fields[0]' - field 3: '$fields[1]'\n"; }
else
{ warn "cannot extract field 2/3 of line $. Fields are: '$fi
+elds'\n"; }
}
else
{ warn "can't parse line $."; }
}
# only 2 words, that's easy
sub heuristic1
{ my $fields= shift;
my @fields= split /\s+/, $fields;
if( @fields == 2) { return @fields }
else { return; }
}
# more than one space separates the 2 fields
sub heuristic2
{ my $fields= shift;
my @fields= split /\s\s+/, $fields;
if( @fields == 2) { return @fields }
else { return; }
}
__DATA__
122 Genesis Chamber Mark Tedin A U
123 f2w f3w 4 5
123 f2w1 f2w2 f3w 4 5
123 f2w1 f2w2 f3w1 f3w2 4 5
123 f2w f3w 4 99
123 f2w1 f2w2 f3w1 f3w2 4 5
| [reply] [d/l] |
Re: Abritrary multiple spaces as delimiter
by davido (Cardinal) on Mar 15, 2004 at 08:24 UTC
|
If your "arbitrary number of spaces" is at least two, and your second element never contains two consecutive spaces, you're well on your way.
my $string = "122 Genesis Chamber Mark Tedin A U ";
my ( $second ) = $string =~ m/\s{2,}(.+?)\s{2,}/;
print "$second\n";
That works if, as I stated, you are sure that the second field is surrounded by at least two whitespace characters.
Update: I want to point out that I specifically avoided the intense temptation to use unpack for two reasons: First, though the thought crossed my mind that this might be a dataset with aligned columns (fixed-width fields), the OP didn't specify that to be the case, and so since I had to make some assumption about the data, I chose the two-or-more space delimiter assumption rather than the fixed-width field assumption. My second reason was that the OP asked to solve the problem with a regexp, unpack wasn't on the table.
However, if it turns out that we are dealing with fixed-width fields, the regexp solution is simply the wrong tool for the job, and unpack is the right tool. My advice to the OP is to use the unpack solution if the data is in a fixed-width field format, or to consider one of the regexp solutions provided if the data's format is non-fixed-width.
| [reply] [d/l] |
Re: Abritrary multiple spaces as delimiter
by BrowserUk (Patriarch) on Mar 15, 2004 at 08:36 UTC
|
This will work *if* you can guarentee that each of your two variable length fields will only contain single spaces?
my $s = '122 Genesis Chamber Mark Tedin
+ A U';
print join '|', $s =~ m[^(\S+)\s+\b(.*)\b\s{2,}\b(.*)\b\s{2,}(\S)\s+(\
+S)$];
122|Genesis Chamber|Mark Tedin|A|U
m[^ # With the $, match the whole line
(\S+) # The first field contains no spaces
\s+ # and is separated from the next by at least 1
\b(.*)\b # 2nd field starts/stops on a word boundary
# it can contain anything,
\s{2,} # but only single concecutive spaces.
\b(.*)\b # 3rd field is similarly defined
\s{2,} # Again, 2 or more spaces defined the end of field
(\S) # 4th Single non-space char
\s+ # 1 or more spaces
(\S) # 5th Single char.
$]x
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
Re: Abritrary multiple spaces as delimiter
by Abigail-II (Bishop) on Mar 15, 2004 at 09:50 UTC
|
So, how do you determine the second element is "Genesis Chamber" and not "Genesis", with the third element being
"Chamber Mark Tedin"? Or perhaps the second element is
"Genesis Chamber Mark", and the third is "Tedin"?
Now, if elements are separated by at least two spaces, and
between words belonging to the same element there's just a
single space, you could split on /\s{2,}/. But if
the delimiter is "arbitrary spaces", then the problem is not
uniquely solvable.
Abigail | [reply] |
Re: Abritrary multiple spaces as delimiter
by guha (Priest) on Mar 15, 2004 at 08:20 UTC
|
How would you know where the boundary between the second and the third element is?
Or is it a fixed (positional) format?
| [reply] |