Speedfreak has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing an app that pulls data from a remote server and initially I was using FTP to get a text file. However, the FTP service is too unreliable but they have and HTTP service which seems to be more robust.

Trouble is, its all been HTML'd so the data I want is all hidden away inside the code.

Does anyone have any good examples of stripping HTML away to just leave the text inside the tags or replacing br and p tags with newlines?

Pointers, examples, URL's to places that show how to handle HTML would be nice.

- Jed

Replies are listed 'Best First'.
Re: Parsing/Extracting Data from HTML.
by btrott (Parson) on Mar 22, 2000 at 22:46 UTC
    Don't know if this is exactly what you want, but you should be able to tweak it as necessary:
    my $url = "http://www.foo.com/"; my $cols = "70"; use HTML::TreeBuilder; use HTML::FormatText; use LWP::Simple; my $content = get $url; my $html = new HTML::TreeBuilder; $html->parse($content); my $f = new HTML::FormatText(leftmargin => 0, rightmargin => $cols); print $f->format($html);
Re: Parsing/Extracting Data from HTML.
by Anonymous Monk on Mar 23, 2000 at 06:52 UTC
    You could also try sending the HTML through the program 'Lynx', a text based Web browser which is installed on most Unix systems. Sometimes you have to write the data to a file and then have lynx open that file with the -dump option, i.e.
    $plaintext = `/path/to/lynx -dump $url`;
    
    I think this will work. (make sure to pass $url through a reg-ex if it can be entered by an unknown user). Lynx does really nice conversion from HTML to plain text.
Re: Parsing/Extracting Data from HTML.
by juahonen (Novice) on Mar 23, 2000 at 18:34 UTC
    Perl can covert HTML to text too...

    $htmltext =~ s/<(.*)>//g;

    ...will replace all tags with emptiness.

    If you wish to convert br's and p's to newlines before they are stripped, add:
    $htmltext =~ s/<(br|p)>/\n\n/ig;
    before the first command.

    Of course, you'll lose all formatting. This method is not quarenteed to properly strip comments.

      No, don't do that. It's too greedy:
      my $string = "<first><second>blahblah<third>\n"; $string =~ s/<(.*)>//g; print $string;
      Result: (Hey, it's blank!)

      If you really want to do it this way, use: $string =~ s/<[^>]*?>//g; The question mark keeps the asterisk from slurping up any character -- including angle brackets -- to the end of the line, and then backtracking to pick up that last angle bracket. Of course, so does the negated character class. Just be more specific.