(Ovid) Re: Help with reg expressions
by Ovid (Cardinal) on May 03, 2001 at 20:27 UTC
|
You'll need to be more specific. First of all, you stated that the common name was in the form "(something)-disclose.gif". Do you really have parentheses in the common name? If so, you'll need to escape those in the regular expression. Further, what's allowed for the "something"? Is it all letters, are numbers allowed? Is punctuation allowed?
First advice: ignore what the Anonymous Monk said about using the dot star combination. It's slow and virtually guaranteed to break your code. See Death to Dot Star! for an explanation.
Here's a rough stab at your regular expression, assuming that the "something" cannot contain a dash:
# also assumes that the parentheses are not in the filename
/<img # match the '<img'
\s+ # must have one or more whitespace characters
src # match the 'src'
\s* # may have zero or more whitespace characters
= # match the =
\s* # may have zero or more whitespace characters
[^-]+ # one or more non-dash characters
-disclose.gif # rest of .gif name
[^>]* # zero or more non-closing angle brackets
>/ix # closing angle bracket
Cheers,
Ovid
Update: To whomever the Anonymous Monk is who criticized my regex: sign on and join us! That was a great catch and we could use more programmers like you!
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] |
|
>>To whomever the Anonymous Monk is who criticized my regex: sign on and join us! That was a great catch and we could use more programmers like you!
Thank you, I very much appreciate that. As you can see, I have now signed on.
The 15 year old, freshman programmer,
Stephen Rawls
| [reply] |
Re: Help with reg expressions
by chipmunk (Parson) on May 03, 2001 at 20:21 UTC
|
This will probably do what you want, but I'm making assumptions that your HTML doesn't contain unusual things like >s in the attribute values:
s/<img [^>]*src=[^>]+-disclose.gif[^>]*>//ig;
You should be careful about using regular expressions to parse HTML, however. Using a proper parser, such as HTML::Parser, will make your script much more robust. | [reply] [d/l] |
Re: Help with reg expressions
by astanley (Beadle) on May 03, 2001 at 20:23 UTC
|
Well if (something) is all the pattern you can provide for us then really the only regex match I can think of is .*. However based on your description I can't tell if you have parantheses in the filenames or not. I'll give you a regex for each situation.
if (/\<img src=\((.*)\)-disclose\.gif.*\>/i) { $_ = "" }
That will work if the filenames have parantheses. The following will work if they do not:
if (/\<img src=\(.*\)-disclose\.gif.*\>/i) { $_ = "" }
After the regular expression the variable $1 will contain the name between (something). (ie: print "$1-diclose.gif\n"; will give you the filename)
WARNING: the regexp's are untested but should give you the idea! (in a regex putting a match between () assigns the variables $1,$2,$3...)
-Adam Stanley
Nethosters, Inc. | [reply] [d/l] [select] |
Re: Help with reg expressions
by Anonymous Monk on May 03, 2001 at 23:47 UTC
|
ovid offered some good advice, all but for one thing.
Please, DO NOT USE THIS LINE FROM OVID'S REGEX:
[^-]+
take the folowing line for example,
<img src="this.gif"><img src="something-disclose.gif">
what do you think his regex will match ...
think carefully--it matches the whole line. I recomend reading the book by Jeff Freidel (sp?) called Mastering Regular Expressions. In it, he stresses to say what you really mean. You should replace this line:
[^-]+
with this:
[^->]
The reason is that ^- means anything but a dash, well the > sign isn't a dash, so his regex will match that, all the way to the first -, witch could be in another image tag on the same line.
The 15 year old, freshman programmer,
Stephen Rawls | [reply] [d/l] [select] |
Re: Help with reg expressions
by DeusVult (Scribe) on May 04, 2001 at 01:23 UTC
|
Ok, folks, I think some people are getting a little paranoid with the questions. Now, nearly every post asked for a clarification of the "(something)-disclose.gif". Now, we all know what happens when we assume, but really...this is a filename! I don't know what the hell kind of crazy operating systems you people are using, but have you ever seen any that allowed parentheses in a filename? Methinks y'all are just getting a wee bit persnickety. In any case, I think we can assume that the (???) can be safely replaced by
/\<img src=[a-zA-z0-9\.\_\-]...
Of course, anything can happen. One guy at work somehow ended up with a file named * on his Solaris machine. Yes, that's right, the file is named *. Of course, its a useless file, but he sure as hell doesn't want to delete it...
And somehow another guy ended up with a file named backspace. No, not the word backspace, the name of the file is actually the single character backspace. He hasn't tried to get rid of that one either.
But if you have any files like that, I suggest you don't trust it to an automated script to get rid of.
If you have any trouble sounding condescending, find a Unix user to show you how it's done.
- Scott Adams | [reply] [d/l] |
|
I don't know what the hell kind of crazy operating systems you people are using, but have you ever seen any that allowed parentheses in a filename?
If you have any trouble sounding condescending, find a Unix user to show you how it's done.
- Scott Adams
Does that answer your question? ;)
Oh, and under MacOS, it's even easier to use filenames with parentheses in them, because you don't have to worry about escaping them.
| [reply] |
|
| [reply] |
|
Agreed.
Just look for any tag containing disclose.gif and remove the entire tag. If you want to get more indepth then write a grab anything between src= and disclose.gif.
Sheesh, really not that bad.
| [reply] |
Re: Help with reg expressions
by srawls (Friar) on May 04, 2001 at 00:31 UTC
|
Hi, I replied earlier with the ^-> negated character class. I just noticed your code at top, you might want to consider replacing this:
if (/\<img src=(???)-disclose.gif(.*)\>/i) {
$_ = "";
with this: (note: the improved regex is inserted below)
lc; #Much more efficient than having the regex do it
s/\<img\s*src\s*=[^->]-disclose.gif[^>]*>//g;
#more effecient to use a substitution
Please note that I used the lc function, because having the regex perform case insensitive searching is a lot of overhead, also I changed your $_ = "" to a substition, which is also more effecient. Also, if you had more than one img tag on a line before, the whole line would have been deleted if even one of the tags had the -disclose.gif text in them. Also I changed the end of the regex. Instead of matching the rest of the string with
(.*)\>
I used the negated character class ^>. Again, the .* is greedy, so it could have matched other img tags on the same line, for that mater, it could have matced anything on the same line.
The 15 year old, freshman programmer
Stephen Rawls | [reply] [d/l] [select] |
|
Whoops, major type. In this line I submited:
s/\<img\s*src\s*=[^->]-disclose.gif[^>]*>//g;
You need to change this:
[^->]
to this:
[^->]+
Sory About that.
The 15 year old, freshman programmer
Stephen Rawls | [reply] [d/l] [select] |
A reply falls below the community's threshold of quality. You may see it by logging in. |