Cleaning Regexp

I'm trying to clean up my regexp so that it only outputs the link to the URL I am parsing. Currently it will grab the whole line containing my query.

$counter = 0;
$textfile = "rfl.txt";

open (TEXT, "$textfile") || die "Con't Open $textfile";


@text=<TEXT>;
close(TEXT);
$glob = @text;

for($i=0; $i<$glob; $i++){
    $_ = @text[$i];
    if(/eid=/){
    $counter = 1 - $counter;
        print $_ if $counter;
    }
}
[download]

Example Text:

'junk'

website: http://www.jxxxxx.com/
> Additional links on www.jxxxxx.com:
> http://www.jxxxxx.com/
> http://www.jxxxddd.com/> 
http://www.jtrsss.com/redir.asp?id=7446&email=murad@webtv.net
> http://www.jaxost.com/unsubscribe.asp?eid=3701&customer=murrad@webtv
+.net
> http://www.jaceee.com/redir.asp?id=7445&email=murad@webtv.net
Fri, 20 Jul 2001

'more junk'

Example Output:

f=http://www.jxxxx.com/bp_unsubscribe.asp?eid=3711&customer=x>
<A href=http://www.jadssss.com/unsubscribe.asp?eid=3716&customer=x>her
+e</a>
If you would rather not receive these messages, please unsubscribe at:
+<br><

Wished it was:

http://www.jxdsl.com/bp_unsubscribe.asp?
http://www.jxxxx.com/bp_unsubscribe.asp?
http://www.jxrexx.com/bp_unsubscribe.asp?
[download]

Comment on Cleaning Regexp Select or Download Code

Replies are listed 'Best First'.
Re: Cleaning Regexp by hillard (Acolyte) on Jul 31, 2001 at 00:03 UTC
I am kind of new to Perl, but I think that you can do a split() on the line you have, using split('character') where character is whatever you want to split the line with. Then just loop the new array searching for your string again. That should then give you a matching string which has been stripped at whatever character you choose to split with.	[reply]
Re: Cleaning Regexp by hillard (Acolyte) on Jul 31, 2001 at 00:08 UTC
Oh, it would look like this I think... `@new_array = split(' '); #inside the if block (splitting on a space)` [download] then... write a loop like your first one to find the entry in the new array that matches your pattern	[reply] [d/l]
Re: Re: Cleaning Regexp by hillard (Acolyte) on Jul 31, 2001 at 00:34 UTC
I am making myself look pretty silly here by not giving complete thoughts, if you just split on '=' in a line you have already matched, then the array element after 'eid' will be your data... here ends my dismal attempt at posting something useful...	[reply]
Re: Cleaning Regexp by Agermain (Scribe) on Jul 31, 2001 at 01:20 UTC
This is untested, but I think this should do the trick: `for($i=0; $i<$glob; $i++){ $_ = @text[$i]; if(/href="?([^">])\?eid=/is){ $counter = 1 - $counter; print $1 if $counter; } }` [download] Explanation: `"?` The quote mark is optional `([^">])` The parens say to snag everything. The `[^">]` quantifies 'everything' as being any character that isn't a quote or an end-bracket. `\?eid=` This is the key phrase I'd include more code, but it's prone to errors. The href="? marks the beginning of your search pattern, and the \?eid= marks the end. If it doesn't have an eid, it should return everything within parens to the special regex string $1. Update:* I added the "is" options to the regex to allow uppercase URLs also. No biggie. Update 2: Forgot that CODE tags don't need entity substitution, so I had > in one part instead of just >. I fixed it; sorry if I confused anyone. agermain "I don't want the world. I just want your half."	[reply] [d/l] [select]
(bbfu) Re2: Cleaning Regexp by bbfu (Curate) on Aug 01, 2001 at 21:57 UTC
Hrm. You might want to add `\s` and probably `?` (at least, I think amearse doesn't want the query string included...) to your character class, since both of these would also delimit the end of the URL. In fact, you might just be better off using an affirmative class instead of a negative one, since the list of allowable characters is only `[\w/:$-_.+!'(),%@]` (though I've often seen `~` unescaped, and there might be others that are commonly not escaped properly...) (and, again, this is only for the non-query-string part of the URL). Then again, there's a module on CPAN to do all of this (and more) already... bbfu* Seasons don't fear The Reaper. Nor do the wind, the sun, and the rain. We can be like they are.	[reply] [d/l] [select]
(bbfu) (URI::Find) Re: Cleaning Regexp by bbfu (Curate) on Aug 01, 2001 at 22:15 UTC
You might want to check out URI::Find. That way, you can use it to search your text for links, then grep the resulting list of valid URLs for the `eid=` part, printing out any that match. #!/usr/bin/perl use strict; use warnings; our $textfile = "rfl.txt"; our ($text, @urls, @eidurls, $finder); use URI::Find; # Create the finder object with a callback to add # URLs to our list. Note that URI::Find expects # the callback to return the (possibly modified) # URL to be reinserted into the text (hence the # trailing shift). $finder = URI::Find->new( sub { push @urls, $_[0]; shift} ); # Read the file into $text, using "slurp mode" # (ie, the whole file is read in at # once instead of a line at a time) { open my $textfh, " $textfile" or die "Can't open file: $!\n"; local $/; # slurp mode $text = <$textfh>; close $textfh; } $finder->find(\$text); # Find 'em. They'll be in @urls now. @eidurls = grep /eid=/, @urls; # Find the ones we're interested in. print "The URLs are:\n"; print " $_\n" for @eidurls[1..$#eidurls]; # skip the first one [download] bbfu Seasons don't fear The Reaper. Nor do the wind, the sun, and the rain. We can be like they are.	[reply] [d/l] [select]