pavan474 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I have requirement to search and replace some URL's existing in files(may be XML's, text files e.t.c.,) with their equivalents

The URL's are like:

http://www.abc.com

Equivalent :: http://www.test.com

http://www.abc.com/test

http://www.abc.com/test2

Now i am taking all these URL's and their equivalents in a hash with key as URL and value as equivalent URL and comparing each file for each URL and replacing

The issue is i have some other URLS in the files which will satisfy the above condition i.e., for Ex: http://www.abc.com/perl/test/

When i go for search and replace of URL http://www.abc.com it is replacing the above new one(i.e.,ttp://www.abc.com/perl/test/) which shouldn't happen

Can you guy's please suggest a better approach for this

Replies are listed 'Best First'.
Re: URL search and replace in the files (updated)
by haukex (Archbishop) on Dec 09, 2016 at 12:40 UTC

    Hi pavan474,

    It sounds to me like your regexp might be something like s/www\.abc\.com/www.test.com/g (please show your code!), which is indeed a bit too simple. Please see Re^5: Grab input from the user and Open the file, where I showed how to use Regexp::Common to match full URLs and URI to parse them, and in your case you could use that same module to exchange the hostname.

    Update: Upon rereading your post it sounds like you're not just exchanging hostnames in the URLs, but the same idea applies - use Regexp::Common to search for the full URLs and replace with whatever you like. Also, as I mention in the post I linked to, note that Regexp::Common does not match the #fragment part of URLs, so if you've got those you may need the alternate solution I presented.

    Update 2: You mention XML files. Please note that a simple search/replace regex on an XML file may not be a good idea, as you could potentially modify parts of the file you don't want changed, or break its syntax. When dealing with XML/HTML, it's usually better to use an appropriate parser - see here and here for examples.

    Hope this helps,
    -- Hauke D

    Minor Update: Clarified wordings.

Re: URL search and replace in the files
by FreeBeerReekingMonk (Deacon) on Dec 09, 2016 at 20:34 UTC
    Well, clearly, as you already write

    $lines =~ s{http://www.abc.com}{http://www.test.com} does not work because it is too generic

    options:

    1. You pause at each substitution and ask if it should be replaced. You cache the answer, so you only ask once per URL. Takes a while.

    2. You dump all found URL's into a single, sorted file, then peruse it. Find things that need to stay the same (blacklist), and things that should be changed (whitelist). What falls in between, you use the $ans=<STDIN> trick to interactively change

    Samplecode for 1:

    #!/usr/bin/perl my %YES; my %NO; $a='pat http://www.abc.com/test.gif ma http://www.abc.com/hello.html h +ttp://www.abc.com/test.gif '; $a=~s{(http://[\w\.\-\?\&\;\#\/]+)}{&ask($1)}gexi; sub ask{ my($url) =@_; return $url unless index($url,'www.abc.com'); # add more "return $url if condition;" here (blacklist) if($YES{$url}){ $url =~ s/www.abc.com/www.test.com/; return $url; }elsif($NO{$url}){ return $url; }else{ print "substitute $url ?"; $ans = <STDIN>; if($ans =~ m/y/i){ ++$YES{$url}; }else{ ++$NO{$url}; } return ask($url); } }

    3: You already know what you will replace, and it does not match other things,

    use File::Slurp; use warnings; use strict; my %PATTERNS =( 'http://www.abc.com/test\b' => 'http://www.test.com/twist', 'http://www.abc.com/(?:test[\d])\b' => 'http://www.test.com/', ); # patterns to regexps my @REGEXPS = map { qr/$_/ } keys %PATTERNS; # read from commandline die "usage: $0 <filenames> ...\n" unless @ARGV; for my $filename (@ARGV){ die "NOT A FILE! '$filename' " unless -f $filename; die "NOT READABLE! '$filename' " unless -r $filename; # read in a whole file into an array of lines my $lines = read_file( $filename ); my $changes = 0; for my $r (@REGEXPS){ if($lines =~ $r){ $changes++; last; } } if($changes == 0){ print "no changes for $filename\n"; exit 0; } rename $filename, $filename . ".bak"; my ($r,$s); for $r (keys %PATTERNS){ $s = $PATTERNS{$r}; $lines =~s/$r/$s/gei; } # write out a whole file write_file( $filename, $lines ); print "Modified $filename\n"; }

    4. You take the url, get the new page, if it exists, it needs to be renamed. (curl -I fetches only the headers, and not the content, there you search for the "200 OK")

    $result = `curl -I "$url"`; if($result=~m{HTTP/1.1 200 OK}){ # proceed to rename }

    5. Lots of more options, tired now.

Re: URL search and replace in the files
by 1nickt (Canon) on Dec 09, 2016 at 11:18 UTC

    Hi there pavan474,

    You will need to post a short sample of code that demonstrates your problem, along with the output it gives, then someone will be able to help you.

    See Posting on PerlMonks in the FAQ.


    The way forward always starts with a minimal test.