in reply to Re: Tokeparser Textify Command
in thread Tokeparser Textify Command

wouldn't $text->{textify} = { br => '' }; swap br tags with a blank space? i've tried that as well as a few variations (like putting br in the single quotes or just having br alone in the curly brackets), all to no avail.

Replies are listed 'Best First'.
Re^3: Tokeparser Textify Command
by Aristotle (Chancellor) on Nov 10, 2005 at 06:15 UTC

    The documentation you just quoted clearly states that the value of the key specifies which attribute the replacement text should be taken from (so by default, it replaces <img> tags by the text in their alt attribute). Since you want no replacement text, an empty string should be the appropriate choice. (Not that it matters, since <br> has no attributes to pick replacement text out of.)

    You could be more specific about “no avail” – what is happening and how does it contradict your expectations?

    Makeshifts last the longest.

      fair enough, i didn't articulate my problem thouroughly. here's an example of the HTML i'm reading into the $text variable:

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Steve,</SPAN></FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">The picture was one that you pointed me at in the paper a + couple of weeks ago. I don't have any pictures of mine yet.</SPAN></ +FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Tom</SPAN></FONT></DIV> <BLOCKQUOTE> <DIV align="left" class="OutlookMessageHeader" dir="ltr"><FONT face="T +ahoma" size="2">-----Message-----<BR><B>From:</B> eat@joes.com [mailt +o:eat@joes.com]<BR><B>Sent:</B> Monday, June 10, 2005 3:50 PM<BR><B>T +o:</B> google.com<BR><B>Subject:</B> Re: [Test] another test<BR><BR>< +/DIV></FONT><TT>Tom wrote:<BR>&gt;OK, I finally figured out that you +can post online at the website or just<BR>&gt;send an e-mail.<BR><BR> +Oh and the pic... it looks like it was shot during an<BR>earthquake.& +nbsp; :-)<BR><BR>Steve<BR></TT><TT>To unsubscribe from this group, se +nd an email to:<BR>listmod@google.com<BR><BR></TT><BR></BLOCKQUOTE> <br><br> </div> </td></tr></table>

      to do that, i use the following line in my script:

      my $text = $stream->get_text ("/table");

      this returns the following printed later in the script:

      Steve, The picture was one that you pointed me at in the paper a couple of we +eks ago. I don't have any pictures of mine yet. Tom -----Message-----From:eat@joes.com [mailto:eat@joes.com] Sent:Monday, +June 10, 2005 3:50 PM To: google.com Subject: Re: [Test] another test + Tom wrote: OK, I finally figured out that you can post online at the + website or just send an e-mail. Oh and the pic... it looks like it w +as shot during an earthquake. :-) Steve To unsubscribe from this grou +p, send an email to: listmod@xxxxx.com

      all of the HTML is stripped by nature of the operation, and that's great. i'm looking to keep the BR tags, however, so when i reimport the data elsewhere, it retains the formatting of the original (notice how the text at the bottom is all smashed together with no line breaks).

      so what i meant earlier by no avail, i meant i was still getting the text all squashed together as above.

      hope that made a little more sense.

        If that's the output you're getting, you're not using textify. Aristotle confused $text with $stream from your code, but you shouldn't.
        use strict; use warnings; my $html =<<'__BOB__'; <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Steve,</SPAN></FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">The picture was one that you pointed me at in the paper a + couple of weeks ago. I don't have any pictures of mine yet.</SPAN></ +FONT></DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000"></SPAN></FONT>&nbsp;</DIV> <DIV><FONT color="#0000ff" face="Arial" size="2"><SPAN class="54005130 +1-30062000">Tom</SPAN></FONT></DIV> <BLOCKQUOTE> <DIV align="left" class="OutlookMessageHeader" dir="ltr"><FONT face="T +ahoma" size="2">-----Message-----<BR><B>From:</B> eat@joes.com [mailt +o:eat@joes.com]<BR><B>Sent:</B> Monday, June 10, 2005 3:50 PM<BR><B>T +o:</B> google.com<BR><B>Subject:</B> Re: [Test] another test<BR><BR>< +/DIV></FONT><TT>Tom wrote:<BR>&gt;OK, I finally figured out that you +can post online at the website or just<BR>&gt;send an e-mail.<BR><BR> +Oh and the pic... it looks like it was shot during an<BR>earthquake.& +nbsp; :-)<BR><BR>Steve<BR></TT><TT>To unsubscribe from this group, se +nd an email to:<BR>listmod@google.com<BR><BR></TT><BR></BLOCKQUOTE> <br><br> </div> </td></tr></table> __BOB__ use HTML::TokeParser; { my $stream = HTML::TokeParser->new( \$html ); $stream->{textify} = { br => '' }; my $text = $stream->get_text ("/table"); warn $text; } { my $stream = HTML::TokeParser->new( \$html ); $stream->{textify} = { br => sub { my $t = \@_; if( $t->[0] eq 'S' and $t->[1] eq 'br') { return '<br>'; } return; } }; my $text = $stream->get_text ("/table"); warn $text; } __END__ Steve,   The picture was one that you pointed me at in the paper a couple of w +eeks ago. I don't have any pictures of mine yet.   Tom -----Message-----[BR]From: eat@joes.com [mailto:eat@joes.com][BR]Sent +: Monday, June 10, 2005 3:50 PM[BR]To: google.com[BR]Subject: Re: [Te +st] another test[BR][BR] Tom wrote:[BR]>OK, I finally figured out tha +t you can post online at the website or just[BR]>send an e-mail.[BR][ +BR]Oh and the pic... it looks like it was shot during an[BR]earthquak +e.  :-)[BR][BR]Steve[BR]To unsubscribe from this group, send an email + to:[BR]listmod@google.com[BR][BR][BR] [BR][BR] at html.tokeparser.textify.pl line 27. Steve,   The picture was one that you pointed me at in the paper a couple of w +eeks ago. I don't have any pictures of mine yet.   Tom -----Message-----<br>From: eat@joes.com [mailto:eat@joes.com]<br>Sent +: Monday, June 10, 2005 3:50 PM<br>To: google.com<br>Subject: Re: [Te +st] another test<br><br> Tom wrote:<br>>OK, I finally figured out tha +t you can post online at the website or just<br>>send an e-mail.<br>< +br>Oh and the pic... it looks like it was shot during an<br>earthquak +e.  :-)<br><br>Steve<br>To unsubscribe from this group, send an email + to:<br>listmod@google.com<br><br><br> <br><br> at html.tokeparser.textify.pl line 46.

        MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
        I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
        ** The third rule of perl club is a statement of fact: pod is sexy.