Thalamus has asked for the wisdom of the Perl Monks concerning the following question:

Hi all ! First time posting ... so - please be gentle :)

I want to delete a section of text from a *.html file which contains this text. So, I have to treat it as a multi-line or whatever it is called.



<ADDRESS> someone@some.domain.co.uk </ADDRESS> </BODY> </HTML>

I feel I've tried everything ... -but obviously not, since I haven't figured it out yet. I want to take away the section between the start and end of the <ADDRESS> tag. If I try to take out only the <ADDRESS> it works - the regular expression for the email is also working (on their own), but once I try to do both at the same time I fail missearbly.

perl -i.bak -ne 'if(s!<ADDRESS>.(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})!!mgis) {next;} print;' index.html

Replies are listed 'Best First'.
Re: one line regular expression - help needed
by Corion (Patriarch) on Jul 29, 2010 at 08:18 UTC

    Hello and welcome!

    I think your "problem" is that <ADDRESS> and </ADDRESS> are on different lines, and -n goes through your file line by line.

    Conveniently, Perl can work with line-oriented stuff quite well:

    perl -i.bak -ne "print unless /<ADDRESS>/ .. m!</ADDRESS>!"

    This approach only works if there is only one <ADDRESS> sequence and you want to remove that.

    Your regular expression does not seem to allow for (much) whitespace between <ADDRESS> and the email address starting. If you change the following dot to \s*, you will have more success in matching. You still need to slurp the whole input at once. I think -0777 will activate slurp mode.

      In case there are several ADDRESS-sections and you want to get rid of them all in one go you can do it like this:

      perl -i.bak -0777 -pe 's|<ADDRESS>.*?</ADDRESS>||gs' <your file>
      Thanks for the response guys. You saved my day.
Re: one line regular expression - help needed
by marto (Cardinal) on Jul 29, 2010 at 08:27 UTC

    Welcome to the Monastery. I'd advise not using a regex to manipulate a HTML/XML, rather using one of the parser modules available. For example read HTML::TokeParser Tutorial.

      Thanks ... will have a look at it.