buc99 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone-

I am trying to write a one-liner that uses regexp to extract certain text from html files. Basically what I have so far is this:

perl -e 's/.*?((<a\sname\s\ssometag.*?)<br><br><\sbody>)/$1/sgi' -p -i.bak 1.html

Basically, I want the program to search through the html files for a specific tag, cut the tag and all of the text leading up to the <br><br></body> tags this will then replace all of the original text.

But when I run this command it replaces the original file with an exact copy of the original file instead of the text that is supposed to cut. The re matches correctly, but it will not span the multiple lines of text even though I end with a /s. This same exact re works in other programming languages with a 'dotall' to span multiple lines, but I can't seem to get perl to do it the same way. I would really rather use a perl 'one-liner' instead of writing a large script that would do the same thing.

Any idea's on how to get this to work?

Thanks.
SA
:)

Replies are listed 'Best First'.
Re: 'one-liner' help
by dws (Chancellor) on Apr 16, 2003 at 17:20 UTC
    Any idea's on how to get this to work?

    Look carefully at what you're matching. Will the two <br> tags always be followed by a

    < body> ^
    tag? I suspect you meant to write '\' instead of '\s' there.

    If that doesn't help, post a fragment of the HTML that you're trying to rewrite.

      Wouldn't <br><br><\sbody> match <br><br></body> since \s matches any non-whitespace? Granted it would also match any <?body> where ? is any non-whitespace character, but that should not happen often in a html file I write. So the question is why it does not want to match some tag/text cut it and replace?

      Here is the script:
      perl -pe 's#.*(<div class="Content.*)</div></div></body>#$1#sgi' -i.bak test.html

      Here is the test html file:
      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"> <html lang="en"> <head> <meta http-equiv="content-type" content="text/html; charset=iso-88 +59-1"> <title>Test</title> <meta name="generator" content="BBEdit 6.5.2"> </head> <body> <div class="Header"> <img src="images/xlogo.gif" alt="" width="612" height="108" border="2" + align="middle"> </div> <div class="Navigation"> <div class="navbox"><a class="nav" href="OSXTips2.html">Home</a>&nbsp; +</div> <div class="navbox"><a class="nav">Tips:&nbsp;</a><br> &nbsp;<a class="nav2" href="bash_shell.html"> &#8226; Bash</a>&nbsp;<b +r> &nbsp;<a class="nav2" href="beta_tools.html"> &#8226; Beta Tools</a>&n +bsp;<br> &nbsp;<a class="nav2" href="http://www.savagetranscendental.com/OSX.ht +ml.htm"> &#8226; Color LS</a>&nbsp;<br> &nbsp;<a class="nav2" href="spl.html"> &#8226; Lost Password</a>&nbsp; +<br> &nbsp;<a class="nav2" href="ppp.html"> &#8226; PPP Setup</a>&nbsp;<br> &nbsp;<a class="nav2" href="spl.html"> &#8226; Password Lock</a>&nbsp; +<br> &nbsp;<a class="nav2" href="TCSH.html"> &#8226; TCSH Setup</a>&nbsp;<b +r> &nbsp;<a class="nav2" href="gimp.html"> &#8226; The Gimp</a>&nbsp;<br> &nbsp;<a class="nav2" href="vnc.html"> &#8226; VNC &amp; Xfree86</a>&n +bsp;<br> </div> <div class="navbox"><a class="nav" href="construction.html">Links:&nbs +p;</a><br> &nbsp;<a class="nav2" href="http://www.darwinfo.org/"> &#8226; Darwinf +o</a>&nbsp;<br> &nbsp;<a class="nav2" href="downloads.html"> &#8226; Downloads</a>&nbs +p;<br> &nbsp;<a class="nav2" href="construction.html"> &#8226; Dev. Tools</a> +&nbsp;<br> &nbsp;<a class="nav2" href="http:/www.apple.com/macosx/"> &#8226; Mac +OSX</a>&nbsp;<br> &nbsp;<a class="nav2" href="http://www.savagetranscendental.com/OSX.ht +ml"> &#8226; More OSX Tips</a>&nbsp;<br> &nbsp;<a class="nav2" href="http://www.osxfaq.com/"> &#8226; OSX Faq</ +a>&nbsp;<br> </div> <div class="navbox"><a class="nav" href="construction.html">PDFs:&nbsp +;</a><br> &nbsp;<a class="nav2" href="osxpdf.html"> &#8226; OSX</a>&nbsp;<br> &nbsp;<a class="nav2" href="netpdf.html"> &#8226; Networking</a>&nbsp; +<br> &nbsp;<a class="nav2" href="unixpdf.html"> &#8226; Unix Tips</a>&nbsp; +<br> </div> </div> <div class="Content"> 11.03.02 <p><img src="images/consbar.gif" width="464" height="41"></p> <p><b><font size="2">?</font></b>Still working on this page. If anyone has links for me to add, please email me at the address bel +ow.</p> <p>Thanx,</p> <p>SA</p> <p>11.03.02</p> <p><a href="http://member.bcentral.com/cgi-bin/fc/fastcounter-login?21 +64123"><img src="http://fastcounter.bcentral.com/fastcounter?2164123+ +4328253" alt="fastcounter" border="0" width="90" height="16"></a><fon +t size="2"><br> </font><a href="http://www.bcentral.com/fastcounter/"><font face="Aria +l, helvetica" size="1">FastCounter by bCentral</font></a></p> <div class="box"> <B>[<U> <a href="http://www.apple.com" title="Apple">Apple</a></U> ] [ +<U> <a href="http://www.apple.com/developer" title="Apple Developer"> +AppleDeveloper</a></U> ] [<U><a href="downloads.html" title="Download +s"> Downloads</a></U> ]</B>&nbsp;&nbsp;&nbsp;&nbsp;<img src="images/e +mailp.gif" alt="email" width="44" height="51"> &nbsp;&nbsp; <span style="font-size: 14pt; ">Send all mail To:<span style="mso-spac +erun: yes"></span></span><a href="mailto:t"> </a><BR> </div> </div> </body> </html>


      Thanks.
      SA
      :)
        Wouldn't <br><br><\sbody> match <br><br></body> since \s matches any non-whitespace?

        \s matches whitespace.
        \S matches non-whitespace.

        Better to quote the backslash, if necessary, to match it explicitly.

Re: 'one-liner' help
by BrowserUk (Patriarch) on Apr 16, 2003 at 22:38 UTC

    If I understand you correctly, the problem is that (by default) -p and -n cause the file(s) to be processed line-by-line. Therefore your regex is being applied line-by-line and can never match a pattern that stretches across more than one line as it never sees more than one line at any one time. Phew! What a mouthful:)

    To work around this problem, you would need to cause the file to be processed as a single long line using 'slurp mode'. Try adding -0777 to your command line. See perlrun for details.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      Nope. Does not seem to work.

      would not the /s in the /sgi cause it to read multilines?

      Thanks.
      SA
      :)
        would not the /s in the /sgi cause it to read multilines?

        No. They modify how the regex behaves, but the regex has nothing to do with how input is handled. If input is a line-at-at-time, /s won't help you.

(jeffa) Re: 'one-liner' help
by jeffa (Bishop) on Apr 17, 2003 at 02:37 UTC
    Well, as you can see from the reponses so far (caveat: i could be corrected ...) that this kind of HTML parsing is tough when your tool is a regex. Why not use a parser instead? I know it's not what you want, but here is some code that uses HTML::TokeParser::Simple to extract just the 'Content' <div> section. Isn't that what you are really trying to do - extract that <div> and everything it contains?
    use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new('file.html'); my $print = 0; # so we'll know when to start printing my $count = 0; # need a 'stack' to keep track of div tags while (my $token = $parser->get_token()) { if ($token->is_start_tag('div')) { $print = 1 if $token->return_attr()->{class} eq 'Content'; $count++; } print $token->as_is() if $print; if ($token->is_end_tag('div')) { $count--; last if $count == 0 and $print == 1; } }
    If you want to use this to modify some HTML files, i am afraid that you will have to save copies instead of doing in-place editing. I recommend saving the new files in a seperate directory, then you can just move the lot up a level and clobber the originals. ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)