in reply to Re^3: Tokeparser Textify Command
in thread Tokeparser Textify Command

fair enough, i didn't articulate my problem thouroughly. here's an example of the HTML i'm reading into the $text variable:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Steve,</SPAN></FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">The picture was one that you pointed me at in the paper a + couple of weeks ago. I don't have any pictures of mine yet.</SPAN></ +FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Tom</SPAN></FONT></DIV> <BLOCKQUOTE> <DIV align="left" class="OutlookMessageHeader" dir="ltr"><FONT face="T +ahoma" size="2">-----Message-----<BR><B>From:</B> eat@joes.com [mailt +o:eat@joes.com]<BR><B>Sent:</B> Monday, June 10, 2005 3:50 PM<BR><B>T +o:</B> google.com<BR><B>Subject:</B> Re: [Test] another test<BR><BR>< +/DIV></FONT><TT>Tom wrote:<BR>&gt;OK, I finally figured out that you +can post online at the website or just<BR>&gt;send an e-mail.<BR><BR> +Oh and the pic... it looks like it was shot during an<BR>earthquake.& +nbsp; :-)<BR><BR>Steve<BR></TT><TT>To unsubscribe from this group, se +nd an email to:<BR>listmod@google.com<BR><BR></TT><BR></BLOCKQUOTE> <br><br> </div> </td></tr></table>

to do that, i use the following line in my script:

my $text = $stream->get_text ("/table");

this returns the following printed later in the script:

Steve, The picture was one that you pointed me at in the paper a couple of we +eks ago. I don't have any pictures of mine yet. Tom -----Message-----From:eat@joes.com [mailto:eat@joes.com] Sent:Monday, +June 10, 2005 3:50 PM To: google.com Subject: Re: [Test] another test + Tom wrote: OK, I finally figured out that you can post online at the + website or just send an e-mail. Oh and the pic... it looks like it w +as shot during an earthquake. :-) Steve To unsubscribe from this grou +p, send an email to: listmod@xxxxx.com

all of the HTML is stripped by nature of the operation, and that's great. i'm looking to keep the BR tags, however, so when i reimport the data elsewhere, it retains the formatting of the original (notice how the text at the bottom is all smashed together with no line breaks).

so what i meant earlier by no avail, i meant i was still getting the text all squashed together as above.

hope that made a little more sense.

Replies are listed 'Best First'.
Re^5: Tokeparser Textify Command
by PodMaster (Abbot) on Nov 10, 2005 at 09:09 UTC
    If that's the output you're getting, you're not using textify. Aristotle confused $text with $stream from your code, but you shouldn't.
    use strict; use warnings; my $html =<<'__BOB__'; <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Steve,</SPAN></FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">The picture was one that you pointed me at in the paper a + couple of weeks ago. I don't have any pictures of mine yet.</SPAN></ +FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Tom</SPAN></FONT></DIV> <BLOCKQUOTE> <DIV align="left" class="OutlookMessageHeader" dir="ltr"><FONT face="T +ahoma" size="2">-----Message-----<BR><B>From:</B> eat@joes.com [mailt +o:eat@joes.com]<BR><B>Sent:</B> Monday, June 10, 2005 3:50 PM<BR><B>T +o:</B> google.com<BR><B>Subject:</B> Re: [Test] another test<BR><BR>< +/DIV></FONT><TT>Tom wrote:<BR>&gt;OK, I finally figured out that you +can post online at the website or just<BR>&gt;send an e-mail.<BR><BR> +Oh and the pic... it looks like it was shot during an<BR>earthquake.& +nbsp; :-)<BR><BR>Steve<BR></TT><TT>To unsubscribe from this group, se +nd an email to:<BR>listmod@google.com<BR><BR></TT><BR></BLOCKQUOTE> <br><br> </div> </td></tr></table> __BOB__ use HTML::TokeParser; { my $stream = HTML::TokeParser->new( \$html ); $stream->{textify} = { br => '' }; my $text = $stream->get_text ("/table"); warn $text; } { my $stream = HTML::TokeParser->new( \$html ); $stream->{textify} = { br => sub { my $t = \@_; if( $t->[0] eq 'S' and $t->[1] eq 'br') { return '<br>'; } return; } }; my $text = $stream->get_text ("/table"); warn $text; } __END__ Steve,   The picture was one that you pointed me at in the paper a couple of w +eeks ago. I don't have any pictures of mine yet.   Tom -----Message-----[BR]From: eat@joes.com [mailto:eat@joes.com][BR]Sent +: Monday, June 10, 2005 3:50 PM[BR]To: google.com[BR]Subject: Re: [Te +st] another test[BR][BR] Tom wrote:[BR]>OK, I finally figured out tha +t you can post online at the website or just[BR]>send an e-mail.[BR][ +BR]Oh and the pic... it looks like it was shot during an[BR]earthquak +e.  :-)[BR][BR]Steve[BR]To unsubscribe from this group, send an email + to:[BR]listmod@google.com[BR][BR][BR] [BR][BR] at html.tokeparser.textify.pl line 27. Steve,   The picture was one that you pointed me at in the paper a couple of w +eeks ago. I don't have any pictures of mine yet.   Tom -----Message-----<br>From: eat@joes.com [mailto:eat@joes.com]<br>Sent +: Monday, June 10, 2005 3:50 PM<br>To: google.com<br>Subject: Re: [Te +st] another test<br><br> Tom wrote:<br>>OK, I finally figured out tha +t you can post online at the website or just<br>>send an e-mail.<br>< +br>Oh and the pic... it looks like it was shot during an<br>earthquak +e.  :-)<br><br>Steve<br>To unsubscribe from this group, send an email + to:<br>listmod@google.com<br><br><br> <br><br> at html.tokeparser.textify.pl line 46.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.