RE: Probably silly regex question / not so silly regex question
by ferrency (Deacon) on Aug 01, 2000 at 23:15 UTC
|
As for your second question:
> Will a regex of the form .*c ever match differently
> from [^c]*c, and if not, why doesn't the regex parser
> optimize it to the later form? What if the first was .*?c
The answer is Yes, those regexes absolutely match different things.
1. m/(.*c)/ && print $1;
2. m/([^c]*c)/ && print $1;
3. m/(.*?c)/ && print $1;
Take the string "abcabc". 1 will print "abcabc" because
it's greedy. 2 will print "abc" because it is looking for
anything except a c, followed by a c. 3 will also print
"abc" because it's non-greedy; it takes the first match
instead of the longest.
How they match is also different. 1 will match everything
until the end of the string, then backtrack, one character
at a time, until the last character in the match is a "c"
(a summarized explanation). 2 will only gobble up until
it finds a c, because the c won't satisfy [^c]; then it
matches the c. No backtracking. The non-greedy version
(3) will match zero characters, and then look for a c;
then match 1 character, and then look for a c; then match
match 2 characters, and look for a c; and so on.
This difference in how the matching is done has a big
influence on performance. Since .* will
always gobble up the whole string, and then backtrack,
if you have a Really Long String which starts with "abc",
it'll take a Really Long Time (relatively) for pattern 1
to match, compared to pattern 2 or pattern 3.
As for your first question... I'm not entirely sure why it's not working. I'll try it out and update if I come up with anything.
Alan
Update: I'm not willing to say that Under All Circumstances, patterns 2 and 3 above will match the same thing. I can't think of a counterexample off the top of my head, but that doesn't mean one doesn't exist. I'm also not willing to say that one is faster than the other without benchmarking.
For all I Really Know, they could match the same thing, and perl could already be optimizing one into the other for the sake of performance. But, you're right, it does certainly look like they'll both always match the same thing. Not necessarily, if part of a larger regex, though. | [reply] [d/l] [select] |
|
|
> m/([^c]*c)/ will only gobble up until it finds a c, because the c won't
> satisfy ^c; then it matches the c. No backtracking. The non-greedy version (m/(.*?c)/) will
> match zero characters, and then look for a c; then match 1 character, and then look
> for a c; then match match 2 characters, and look for a c; and so on.
This seems to indicate that m/(.*?c)/ and
m/([^c]*c)/ will always match the same
strings, but the first will be far fatster (in _all_ cases,
that I can see at least). So why isn't the first
optimized into the second?
I'm probably not seeing somthing here.
Thanks,
James Mastros,
Just Another Perl Initate | [reply] [d/l] [select] |
|
|
m/(.*?c)/ will not match newline if the /s modifier is not used.
| [reply] |
|
|
I don't think that '.' matches a "\n" by default.
One difference at least.
| [reply] |
Re: Probably silly regex question / not so silly regex question
by tye (Sage) on Aug 02, 2000 at 00:17 UTC
|
Well, not all file names have extensions, so you might want one of these:
( $base, $ext )= $file =~ /^(.*?)([.][^.]*)?$/s;
$ext= ( $base= $file ) =~ s/([.][^.]*)$// ? $1 : "";
$base= substr( $ext= $file, 0,
0xffff & rindex($file,"."), "" );
/.*?c/s and /[^c]*c/ will always (I'm pretty sure) match the same thing (note the addition of "s" to the first one) as they stand. But you were using these as part of a larger regex, and there they don't always match the same thing. For example, /.*?cd/s and /[^c]*cd/ (adding a "d" to each one) don't always match the same things. On the string "abcabcd", the first will match "abcabcd" while the second will only match "abcd". | [reply] [d/l] [select] |
|
|
Ah, thank you (for your second point); that finaly clears
it up. (All the other answers didn't parse the question
as I meant it.)
As for the second part, since all these files have a name
of the form a-z+-0-9+a-z0-9-*.jpg, so I just
used a simple split /\./. (They're all jpegs now, but
that might change; I wanted it to be at least a little more
generic.) Your point is, of course, well
taken, though there's probably a large problem space where
you do know that there is an extention.
Thanks also to all the other people that had solutions
to the $id.$ext problem as well; your answers were almost
all enlightening.
I still wonder which is more efficent. I'll probably
do some benchmarking at some later point.
Thanks,
James Mastros,
Just Another Perlmonks Neonate | [reply] |
The (split) answer
by infinityandbeyond (Sexton) on Aug 01, 2000 at 23:56 UTC
|
You suggested that split wouldn't work if there was a dot in the extension. It just might, in fact, still work using split...and adding a join.
@dissection = split /\./, $filename;
$ext = pop(@dissection);
$id = join(".",@dissection);
- Infinityandbeyond | [reply] [d/l] |
|
|
While we are doing high level manipulations, you can also do:
($name,$ext) = reverse map {scalar reverse} (split /\./, (reverse $fil
+ename), 2);
Ciao,
Gryn | [reply] [d/l] |
|
|
my( $name, $ext );
{
my @chunks = split/\./, $filename;
$ext = pop @chunks;
$name = join '.', @chunks;
}
If you don't like reverse for some reason.
Actually I really liked gryng's answer... I was just trying to thing of another way to do it using split.
For them that like one-liners: my( $ext, $name )=(@_=split/\./, $filename and pop @_, join '.', @_);
| [reply] [d/l] [select] |
I'm not good at regex's
by gryng (Hermit) on Aug 01, 2000 at 23:07 UTC
|
I'm not very good at regex's theorbtwo, so instead of saying something vague and probably wrong, I'm just going to offer up real quick the regex I'd use (and hope it works):
/(.*)\.([^\.]+)/
| [reply] [d/l] |
|
|
gryng wrote:
I'm not very good at regex's theorbtwo, so instead of saying
something vague and probably wrong, I'm just going to offer up real
quick the regex I'd use (and hope it works):
/(.*)\.([^\.]+)/
The funny thing is, I just learned this afternoon that you don't
need to escape the dot in a character class: so the above is wrong.
Assuming that you meant (obviously you did) what you typed. dot is
just dot inside the brackets. (Camel 2cd Ed page 71, in the middle
of the code in the page).
Intrepid
| [reply] [d/l] |
|
|
I sold my Camels to afford thinkgeek.com's perl book set, and then it went to back order :( . So I'm Camel-less and can't use your nifty reference :( :( . However I do believe you, but I think my code still works it's just I didn't need to escape when I did. Thank you though :) .... :( but I'm still sad about my missing Camels :(
Gnight,
Gryn
| [reply] |
|
|
Stupid mistake
by theorbtwo (Prior) on Aug 01, 2000 at 23:16 UTC
|
I just found out the "probably figure out on my own in due time" -- $filename wasn't getting assigned the value I thought it was.
I had written
$filename = (undef, undef, $images[$n]->{imagename}) = (File::Spec->splitpath( $_, 0 ));
Which causes $filename to always get the value 3. The (more) correct code is:
(undef, undef, $filename = $images[$n]->{imagename}) = (File::Spec->splitpath( $_, 0 ));
(There's probably a better way of doing in both cases, but I didn't feel like delving too deep into the depths of File::Spec).
Thanks,
James Mastros,
Just Another Perl Initate | [reply] [d/l] [select] |
Re: Probably silly regex question / not so silly regex question
by Anonymous Monk on Aug 02, 2000 at 00:02 UTC
|
Lemme just try to provide something I think will work.
My feeble attempt: match backwards to the first dot on the right; that's your extension. The rest is the ID.
$filename =~ /(.*?)\.(.*?)$/;
$id = $1;
$ext = $2;
Does that work?
| [reply] [d/l] |
RE: Probably silly regex question / not so silly regex question
by BlaisePascal (Monk) on Aug 01, 2000 at 23:20 UTC
|
I'm gonna think on the several parts of "question 1", but "the other question" is easy...
The regex /.*c/ will match the longest prefix that ends in c. So if fed "acrobatic cockatoos can fly", it would match "acrobatic cockatoos c". The regex /[^c]*c/ would match the first prefix that ends in c, or "ac".
As far as /.*?c/ goes... I think that is roughly equivilant to /[^c]*c/, but I'm not positive.
| [reply] [d/l] [select] |
Re: Probably silly regex question / not so silly regex question
by Anonymous Monk on Aug 02, 2000 at 07:12 UTC
|
Is it just me or isn't .*? a bit redundant? I mean * is satisfied matching 0 times or more, so why make it optional? Also, as far as I know filenames need to have something before the extenstion, so .+ might be more accurate. Just my take on it.
#Moo | [reply] |
|
|
.*? means match the smallest number -- a non-greedy match. Whereas .* means match the largest number -- a greedy match. Changing the * to a + only changes the minimum match amount (0 to 1). As a side note you can say .{0,} as the equivalent to * and {1,} as + (and, for example, .{3,5} would match 3 to 5 characters).
Gnight,
Gryn
| [reply] |
Re: Probably silly regex question / not so silly regex question
by princepawn (Parson) on Aug 02, 2000 at 12:28 UTC
|
As I look at all 10 or so posts above this one, I am staggered to see that noone has mentioned File::Basename as an architecture independant way to get either the basename or extension or path of any filename:
use File::Basename
$basename = basename($filename);
Should work... but there is no Perl interpreter in a window for me here at Perl Monks (hint).
| [reply] [d/l] |
|
|
I was just going to write that...
I use File::Basename all the time for filename splitting. Hasn't failed me yet and heres how you do your bit.
use strict;
use File::Basename;
my $file = '/ftp/dest/id.ext';
my ($name, $path, $suffix);
my @suffixes = ('.ext'); # Extra extensions can be added to the list.
($name, $path, $suffix) = fileparse($file, @suffixes);
Coffee, KitKat, and a new script to write. It's gonna be a good day.... | [reply] [d/l] |
|
|
Thanks,
James Mastros,
Just Another Perl Initate | [reply] |
RE: Probably silly regex question / not so silly regex question
by eLore (Hermit) on Aug 01, 2000 at 23:18 UTC
|
| [reply] |
Re: Probably silly regex question / not so silly regex question
by lindex (Friar) on Aug 02, 2000 at 20:57 UTC
|
/^(.*?)\.(\w+)$/;
maybe?
lindex
/****************************/
jason@gost.net, wh@ckz.org
http://jason.gost.net
/*****************************/
| [reply] [d/l] [select] |
|
|
As for the first question (posted over a year ago... whew), this is a pretty easy way to do it:
$file = $filename;
$ext = $filename;
$file =~ s/.+\\(^\\+)$|.+\/(^\/+)$/$1/;
$ext =~ s/\.(\w+)$/$1/;
That should give you the file's name in $file and it's extension in, well, $ext. ;)
- Dave Baughman
| [reply] |