Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Search and replace in html

by Anonymous Monk
on May 08, 2003 at 23:20 UTC ( [id://256704]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to write a perl script which outputs the contents of the html tags to a seperate file. Unfortunatley sometimes/most of the time, the close </html> tag is on a seperate line. I tried using the chomp command to remove all line returns, but my code is still return no results. Can anybody help.

while (<INPUT>) { my $TheLine = $_; chomp $TheLine; if ($TheLine =~ /<html>[^\n]*<\/html>/i) { $TheLine =~ s/<html>([^\n]*)<\/html>/$1/i; print OUTPUT "$1\n"; } } close INPUT; close OUTPUT;

Replies are listed 'Best First'.
Re: Search and replace in html
by tall_man (Parson) on May 08, 2003 at 23:48 UTC
Re: Search and replace in html
by Limbic~Region (Chancellor) on May 09, 2003 at 00:06 UTC
    Anonymous Monk,
    There are a plethora of modules on CPAN that could help you, I would suggest looking at the following search. Roll your own solutions with unknown data sources are likely to fail. With that said - let's assume your HTML is perfectly formatted and you want everything between the start and end HTML tags to include other HTML tags.
    #!/usr/bin/perl -w use strict; open (INPUT,"file") or die "Unable to open input : $!"; open (OUTPUT,">output") or die "Unable to open outpu : $!"; select OUTPUT; $\ = "\n"; my $foundstart; while (<INPUT>) { chomp; next unless ($foundstart || /<html *>/i); if (/<html *>/i && ! $foundstart) { $_ =~ s/^.*?<html *>(.*)$/$1/i; $foundstart++; next unless($_); } if ($_ =~ m|</html *>|i) { $_ =~ s|^(.*?)</html *>.*$|$1|i; print if($_); last; } print; } close INPUT; close OUTPUT;

    Cheers - L~R

      Thanks guys, it has started to get me on my way. I have now found that the files contains multiple html references. Would it simply be a matter of just changing the if to a while.
        Anonymous Monk,
        No - you should not try to roll your own unless you are 100% sure of your data. That is what I was trying to point out. Follow tall_man's advice or find a module you like using the search I provided.

        Cheers - L~R

Re: Search and replace in html
by kilinrax (Deacon) on May 08, 2003 at 23:32 UTC

    Sounds like you could want to set the input record separator ('$/') to undef, then you'll pull the file in one huge chunk rather than line-by-line.

    Try either of the following lines:

    undef $/; local $/;

    The first line will set '$/' to undef for the rest of the file, the second only for the enclosing scope (arguably better).

    You might want to try reading perlvar to get a better idea what '$/' is, and maybe pick a better value to set it to.

Re: Search and replace in html
by LameNerd (Hermit) on May 08, 2003 at 23:31 UTC
    Do you want something like this?
    #!/usr/bin/perl -w use strict; while(<DATA>) { next if /<HTML.*?>/gi; next if /<\/HTML.*?>/gi; print; } __DATA__ <HTML> <HEAD><TITLE>Homepage</TITLE></HEAD> <BODY> <a href='blah.html'> man blah.pl</a><BR> <a href='blah.html'> man blablablah.sh </a><BR> <a href='blah.html'> man blablablablah.sh </a><BR> </BODY> </HTML>
    update ... or maybe ...
    #!/usr/bin/perl -w use strict; while(<DATA>) { s/<HTML.*?>//gi; s/<\/HTML.*?>//gi; print; } __DATA__ <HTML><HEAD><TITLE>Homepage</TITLE></HEAD> <BODY> <a href='blah.html'> man blah.pl</a><BR> <a href='blah.html'> man blablablah.sh </a><BR> <a href='blah.html'> man blablablablah.sh </a><BR> </BODY> </HTML>
      LameNerd,
      Try your code with:
      __DATA__ asdfasdf asdfasdf asdfasdf<htMl >asdfasdf blah </htmlasdf> foo bar </html >asdfasdf asdfasdf
      I am not saying that Anonymous Monk should even be attempting to do this as a roll your own solution (go CPAN) - just thought I would point out a weakness or two.

      Cheers - L~R

        The output is ...
        asdfasdf asdfasdf asdfasdfasdfasdf blah foo bar asdfasdf asdfasdf
        What's wrong with that? It got rid of the html tags?
        I think that is all Anonymous Monk wanted to accomplish.
        That is also why I stated in my original post ...

        Do you want something like this?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://256704]
Approved by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (9)
As of 2024-04-18 08:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found