'one-liner' help

buc99 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: 'one-liner' help by dws (Chancellor) on Apr 16, 2003 at 17:20 UTC
Any idea's on how to get this to work? Look carefully at what you're matching. Will the two <br> tags always be followed by a `< body> ^` [download] tag? I suspect you meant to write '\' instead of '\s' there. If that doesn't help, post a fragment of the HTML that you're trying to rewrite.	[reply] [d/l]
Re: Re: 'one-liner' help by buc99 (Initiate) on Apr 16, 2003 at 18:44 UTC
Wouldn't `<br><br><\sbody>` match `<br><br></body>` since `\s` matches any non-whitespace? Granted it would also match any `<?body>` where ? is any non-whitespace character, but that should not happen often in a html file I write. So the question is why it does not want to match some tag/text cut it and replace? Here is the script: `perl -pe 's#.(<div class="Content.)</div></div></body>#$1#sgi' -i.bak test.html` Here is the test html file: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"> <html lang="en"> <head> <meta http-equiv="content-type" content="text/html; charset=iso-88 +59-1"> <title>Test</title> <meta name="generator" content="BBEdit 6.5.2"> </head> <body> <div class="Header"> <img src="images/xlogo.gif" alt="" width="612" height="108" border="2" + align="middle"> </div> <div class="Navigation"> <div class="navbox"><a class="nav" href="OSXTips2.html">Home</a>  +</div> <div class="navbox"><a class="nav">Tips: </a><br>  <a class="nav2" href="bash_shell.html"> • Bash</a> <b +r>  <a class="nav2" href="beta_tools.html"> • Beta Tools</a>&n +bsp;<br>  <a class="nav2" href="http://www.savagetranscendental.com/OSX.ht +ml.htm"> • Color LS</a> <br>  <a class="nav2" href="spl.html"> • Lost Password</a>  +<br>  <a class="nav2" href="ppp.html"> • PPP Setup</a> <br>  <a class="nav2" href="spl.html"> • Password Lock</a>  +<br>  <a class="nav2" href="TCSH.html"> • TCSH Setup</a> <b +r>  <a class="nav2" href="gimp.html"> • The Gimp</a> <br>  <a class="nav2" href="vnc.html"> • VNC & Xfree86</a>&n +bsp;<br> </div> <div class="navbox"><a class="nav" href="construction.html">Links:&nbs +p;</a><br>  <a class="nav2" href="http://www.darwinfo.org/"> • Darwinf +o</a> <br>  <a class="nav2" href="downloads.html"> • Downloads</a>&nbs +p;<br>  <a class="nav2" href="construction.html"> • Dev. Tools</a> + <br>  <a class="nav2" href="http:/www.apple.com/macosx/"> • Mac +OSX</a> <br>  <a class="nav2" href="http://www.savagetranscendental.com/OSX.ht +ml"> • More OSX Tips</a> <br>  <a class="nav2" href="http://www.osxfaq.com/"> • OSX Faq</ +a> <br> </div> <div class="navbox"><a class="nav" href="construction.html">PDFs:&nbsp +;</a><br>  <a class="nav2" href="osxpdf.html"> • OSX</a> <br>  <a class="nav2" href="netpdf.html"> • Networking</a>  +<br>  <a class="nav2" href="unixpdf.html"> • Unix Tips</a>  +<br> </div> </div> <div class="Content"> 11.03.02 <p><img src="images/consbar.gif" width="464" height="41"></p> <p><b><font size="2">?</font></b>Still working on this page. If anyone has links for me to add, please email me at the address bel +ow.</p> <p>Thanx,</p> <p>SA</p> <p>11.03.02</p> <p><a href="http://member.bcentral.com/cgi-bin/fc/fastcounter-login?21 +64123"><img src="http://fastcounter.bcentral.com/fastcounter?2164123+ +4328253" alt="fastcounter" border="0" width="90" height="16"></a><fon +t size="2"><br> </font><a href="http://www.bcentral.com/fastcounter/"><font face="Aria +l, helvetica" size="1">FastCounter by bCentral</font></a></p> <div class="box"> <B>[<U> <a href="http://www.apple.com" title="Apple">Apple</a></U> ] [ +<U> <a href="http://www.apple.com/developer" title="Apple Developer"> +AppleDeveloper</a></U> ] [<U><a href="downloads.html" title="Download +s"> Downloads</a></U> ]</B>    <img src="images/e +mailp.gif" alt="email" width="44" height="51">    <span style="font-size: 14pt; ">Send all mail To:<span style="mso-spac +erun: yes"></span></span><a href="mailto:t"> </a><BR> </div> </div> </body> </html> [download] Thanks. SA :)	[reply] [d/l] [select]
Re: Re: Re: 'one-liner' help by dws (Chancellor) on Apr 16, 2003 at 18:47 UTC
Wouldn't <br><br><\sbody> match <br><br></body> since \s matches any non-whitespace? \s matches whitespace. \S matches non-whitespace. Better to quote the backslash, if necessary, to match it explicitly.	[reply]
Re: Re: Re: Re: 'one-liner' help by buc99 (Initiate) on Apr 16, 2003 at 22:16 UTC
Re^5: 'one-liner' help by dws (Chancellor) on Apr 16, 2003 at 22:25 UTC
Re: 'one-liner' help by BrowserUk (Patriarch) on Apr 16, 2003 at 22:38 UTC
If I understand you correctly, the problem is that (by default) -p and -n cause the file(s) to be processed line-by-line. Therefore your regex is being applied line-by-line and can never match a pattern that stretches across more than one line as it never sees more than one line at any one time. Phew! What a mouthful:) To work around this problem, you would need to cause the file to be processed as a single long line using 'slurp mode'. Try adding -0777 to your command line. See perlrun for details. Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply]
Re: Re: 'one-liner' help by buc99 (Initiate) on Apr 16, 2003 at 23:01 UTC
Nope. Does not seem to work. would not the /s in the /sgi cause it to read multilines? Thanks. SA :)	[reply]
Re: Re: Re: 'one-liner' help by dws (Chancellor) on Apr 16, 2003 at 23:03 UTC
would not the /s in the /sgi cause it to read multilines? No. They modify how the regex behaves, but the regex has nothing to do with how input is handled. If input is a line-at-at-time, /s won't help you.	[reply]
(jeffa) Re: 'one-liner' help by jeffa (Bishop) on Apr 17, 2003 at 02:37 UTC
Well, as you can see from the reponses so far (caveat: i could be corrected ...) that this kind of HTML parsing is tough when your tool is a regex. Why not use a parser instead? I know it's not what you want, but here is some code that uses HTML::TokeParser::Simple to extract just the 'Content' `<div>` section. Isn't that what you are really trying to do - extract that `<div>` and everything it contains? `use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new('file.html'); my $print = 0; # so we'll know when to start printing my $count = 0; # need a 'stack' to keep track of div tags while (my $token = $parser->get_token()) { if ($token->is_start_tag('div')) { $print = 1 if $token->return_attr()->{class} eq 'Content'; $count++; } print $token->as_is() if $print; if ($token->is_end_tag('div')) { $count--; last if $count == 0 and $print == 1; } }` [download] If you want to use this to modify some HTML files, i am afraid that you will have to save copies instead of doing in-place editing. I recommend saving the new files in a seperate directory, then you can just move the lot up a level and clobber the originals. ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]