aplonis has asked for the wisdom of the Perl Monks concerning the following question:

On Win32 you can search an entire directory for files containing text 'foo'.

Is there a Perl-ish way to do likewise on Unix?

I'd like to search and replace every email foo@there.com with bar@here.us for every file in /foo/bar/ (and all its subdirs) matching /.*html/

There are a lot of them.

Replies are listed 'Best First'.
Re: multi-file search-and-replace
by saskaqueer (Friar) on Jun 16, 2004 at 23:59 UTC

    update: tachyon makes a good point in his reply to this reply. If you think your data set may contain such a catch, make sure you follow his advice :)

    I really shouldn't have written the whole thing for you, but whatever, I'm in a good and helpful mood. This could most likely be done in less code using the 'find' command in combination with perl's inline editting, but I don't know enough about such things. Note that I copped out by simply slurping each file into memory one at a time. You could rescript it to do line-by-line editting if you wish, though you'd have to add usage of a temporary file.

    #!perl -w use strict; use File::Find; find( \&handle_it, '/foo/bar' ); sub handle_it { return unless m!\.html\z!; my $name = $File::Find::name; open( my $fh, '+<', $name ) or (warn("open on $name failed: $!\n") and return); my $content = do { local $/; <$fh> }; $content =~ s!foo\@there\.com!bar\@here\.us!g; seek($fh, 0, 0) or die("seek on $name failed: $!\n"); truncate($fh, 0) or die("truncate on $name failed: $!\n"); print $fh $content; close($fh) or die("close on $name failed: $!\n"); }

      One important thing to remember is that sweeping s/this/that/ functions can have unintended results. For example john.foo@there.com will become john.bar@here.com which may not be the desired result. I highly recommend doing a SEARCH first, printing lines where matches are found (along with the intended changes), and only after an eyeball scan to make sure we are not going to get bitten do the REPLACE. You really need to identify what will change as in big data sets you may be forgetting something rather critical. Worse it may take weeks to discover the error by which time other changes often make resorting to backup (you did make a BACKUP didn't you?) impractical. This advice is based on having been bitten before. Consider:

      while(<DATA>){ my $orig = $_; if ( s/foo\@there\.com/bar\@here\.com/ ) { print "< $orig> $_\n"; } } __DATA__ foo@there.com john.foo@there.com foo@there.com.au <a href="mailto:foo@there.com">Mr Foo</a>

      Which generates:

      < foo@there.com > bar@here.com < john.foo@there.com > john.bar@here.com < foo@there.com.au > bar@here.com.au < <a href="mailto:foo@there.com">Mr Foo</a> > <a href="mailto:bar@here.com">Mr Foo</a>

      Often it can be as simple as adding a boundary condition \b but in this case it does not work (carefully chosen examples :-). I suggest this sort of approach

      s{([a-zA-Z_\.]+\@[a-zA-Zz\.\-]+)} {$1 eq $old ? $new : $1}ge

      In this approach we find email address like tokens, see if they are what we want to change..... Note this *still* leaves you with the issue of Mr Foo now linking to presumably Mr Bar's address. This may or may not be an issue. The initial scan phase let's you *see* if it is an issue.....

      As an aside having email addresses on you web pages is like asking for spam. In this day and age it is worth considering an alernative approach. I would suggest linking to an upload form CGI. The user clicks on an email link like <a href="/cgi-bin/mailer?to=sales>Contact Sales</a> and your CGI provides the email form. This has several advantages. First people can email you without needing a functional email client, it protects your email addresses, and it abstracts the link from sales to sales@mydomain.com into a single config location, removing the prolem you currently have forever. NMS formmail is a reasonably secure canned solution for the back end. An alternative is to use Javascript - there are several techniques.

      cheers

      tachyon

Re: multi-file search-and-replace
by pzbagel (Chaplain) on Jun 17, 2004 at 00:03 UTC

    The easy way:

    find . -name \*.html -type f| xargs perl -pi -e \ 's/foo\@there\.com/bar\@there.com/'

    Later

Re: multi-file search-and-replace
by eXile (Priest) on Jun 17, 2004 at 00:19 UTC

    I hope you are aware of the consequences of using 'real' email-addresses on Internet accessible webpages and I hope the pages you are changing are on an intranet.

    The main consequence is that spammers can 'harvest' email-addresses from Internet accesible pages so within a few weeks the email accounts you listed on these pages are plagued with spam. A simple fix for that is to do a s/foo\@there.com/bar \(at\) here.us/ so the email-addres will be harder to recognize for a spam-email-harvester and still be recognizable by humans.

      Yes, I am aware. Funny thing...on our family home page there are three emails which have been posted for more than two years. One of them, mine, gets lots of spam (which Thunderbird filters none too badly). The other two, my wife's and my son's, get hardly any.

      The difference? I actually send some email, they almost never do. I have to wonder if the emails are not being harvested for addresses en route over the internet. If I were a spammer, that's what I'd do...set me up a good-sized node on the web and just filter feed from email header packets as they passed by. You'd only ever get live ones that way. Always fresh. Bet someone's doing that, probably more than just a few someones. Probably not for themselves to use. Bet they sell them...

Re: multi-file search-and-replace
by phenom (Chaplain) on Jun 17, 2004 at 00:00 UTC
Re: multi-file search-and-replace
by aplonis (Pilgrim) on Jun 18, 2004 at 00:33 UTC

    Okay folks, thank you all very much.

    I'll meditate on these a spell then perhaps add some Tk on top. In due course I'll post it for all to enjoy.

    Thanks again,
    Gan