Truncating real text

cog has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Truncating real text by graff (Chancellor) on Mar 17, 2005 at 02:43 UTC
-- What if the $n'th word has punctuation after it? Why should that matter? Are you asking this just because your code is sensitive to this (when it shouldn't be), or does your concept of "truncation" include the idea of "stipping final punctuation, if any"? -- Suppose if the text contains more text between parentheses and you want to truncate before or after those, but not in the middle. Well, in this case, you're not talking about truncation; you're talking about parsing the input text and then basing your return value on some set of rules (not yet fully stated) that refer to parsed segments -- parenthesized blocks and space-separated word tokens between them -- and that respect segment boundaries. If you were to state your intended rules, and if they were simple enough -- e.g. "when truncation leaves an open-paren with no close-paren, extend the truncation (shorten the return string) such that the open-paren is not returned" -- then maybe you can get by without actually parsing for parens. (But I doubt it would really be that simple -- e.g. what if the text is fully enclosed in a paren pair? Return the whole string? or an empty string?) Well, let's leave parens and parsing aside for now. Why not use split? `sub trunc { my ( $last_word, $text ) = @_; return '' unless ( $last_word =~ /^\d+$/ and $text =~ /\S/ ); @tokens = split /\s+/, $text; return join " ", @tokens[0..$last_word-1]; }` [download] That has the side-effect of normalizing the white-space between words to a single space per word boundary; leading white-space will be preserved, but not trailing white-space. If you'd rather preserve white-space faithfully, use a capturing split, and build the return string a little differently: `@tokens = split /(\s+)/, $text, -1; # keep all white-space my $trunc_str = ''; for ( @tokens ) { $trunc_str .= $_; last if ( /\S/ and --$last_word == 0 ); } return $trunc_str; }` [download]	[reply] [d/l] [select]
Re: Truncating real text by Roy Johnson (Monsignor) on Mar 16, 2005 at 18:47 UTC
`sub trunc { @_ == 2 or return; my ($max_words, $text) = @_; $text =~ /(\W(?:\w+\W){1,$max_words})/ and return $1; } my $str = "there once was a man from Nantucket who kept all his cash i +n a bucket"; print trunc(5, $str), "\n";` [download] Update: this will grab any punctuation (non-word, non-whitespace) attached to words, and strip leading whitespace (though trailing whitespace stays attached): `/((?:[^\w\s]\w+[^\w\s]\s){1,$max_words})/` [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l] [select]
Re: Truncating real text by tlm (Prior) on Mar 16, 2005 at 18:01 UTC
`use strict; my $wr = qr/[-\w.,;:'"()]+/; # needs tuning sub trunc { my $max = shift or return ''; my $text = ' ' . shift or return ''; $text =~ s/^((?:\s+$wr){0,$max}).$/$1/; substr($text, 0, 1) = ''; $text =~ s/\([^)]$//; return $text; } DB<1> p trunc(5, " 'Twas (brillig), and the (slithy toves) did gyre a +nd gimble in the wabe:" 'Twas (brillig), and the` [download] The problem is too ill-defined (how is "word"defined? how is "punctuation" defined?, can the input contain leading spaces?, etc.) to warrant golfing it. The hack of prepending the space and then lopping it off simplifies the regexp. The word regexp `$wr` probably can use some tweaking. the lowliest monk	[reply] [d/l] [select]
Re^2: Truncating real text by gam3 (Curate) on Mar 16, 2005 at 18:45 UTC
Solving the parentheses problem can't be done correctly using only regular expressions as this example demostrates: `print trunc(6, "Are we not men (or (not) woman), we are devo."), "\n";` [download] But embedded parentheses are not common, so you might just ignore this problem. A picture is worth a thousand words, but takes 200K.	[reply] [d/l]
Re: Truncating real text by bcole23 (Scribe) on Mar 16, 2005 at 22:55 UTC
While I wont be offering code, I will offer some advice. I would set up some options that you could use to define how what kinds of punctuation that you'd allow. Something along the lines of: 1. foo's 2. what in the (foo) did you say? 3. I'm going to "foo" you. 4. What I think of foo: good, bad, ugly 5. Man/Woman, which is it? 6. paragraph 1: -- what's up? -- Not much --well foo want's to get a hold of you As you can see, there's many options for punctuation and where you want to break, like do you want to traverse across paragraphs. One question to ask is what it's for.. It almost seems like it'd be easier to just grab a certain number of characters from the text and then continue to the end of whatever word you're on. For instance, in this sentence(s), just grabbing the first 25 characters, which puts you in the middle of the word 'sentence' and going to the end of that word, including puctuation etc. So one issue is whether or not getting the exact number of words is nessecary or if it's just getting so much of the text and then not munging the end on punctuation. A little clarification on what it's needed for will help you out here.	[reply]
Re^2: Truncating real text by bcole23 (Scribe) on Mar 17, 2005 at 00:58 UTC
Hey, for once I'm going to actually put some code on one of these darn things!! `use warnings; use strict; my $text = "I'd like a foobar(s)! What about you?"; print "Pretext: $text\n"; my $num_of_chars = 13; $text =~ s/(.{$num_of_chars}[\w\|$.$\w])./$1/; print $text;` [download] OUTPUT: Pretext: I'd like a foobar(s)! What about you? I'd like a foobar(s) Now, all you'd really need to do is expand on the last part to correctly handle which punctation you want to handle for the last part. here's my second try if you want word counts `use warnings; use strict; my $text = " I'd like a foobar(s)! What about you?"; print "Pretext: $text\n"; my $num_of_words = 6; my @text; for (my $i=1; $i<=$num_of_words; $i++) { $text =~ s/^\s([\w\|$\|$\|'\|:\|!\|\.\|\,\|;]\s)(.*)/$2/; push (@text,$1); } for (@text) { print $_; }` [download] OUTPUT: Pretext: I'd like a foobar(s)! What about you? I'd like a foobar(s)! What about This is just taking the first word, regardless of puctuation, which may cause some grief, and the next space(s), and putting it in an array. A good check to put in there would be that if $text becomes undef to stop looping. Also, if you're reading in from a file line by line, what I'd do is get the first few lines and concatenate them together to form your text to get the data from if you're going over say, 50 or so words. Please note that I'm nowhere near the level of others here and the code above is laughable by perl monk standards, but it should help you get along. :) UPDATE: Updated code a bit.	[reply] [d/l] [select]
Re: Truncating real text by bageler (Hermit) on Mar 17, 2005 at 15:41 UTC
my goal here was to split the body of an article into the first paragraph and the rest, so an advert can be inserted. Since these are html articles, a paragraph is defined by a <p> or <br> tag, and since it's news the end punctuation will either be a period or a double quote. `my $len = length($story->{body}); $len = $len > 800 ? 400 : $len < 150 ? 0 : 150; ($para1,undef,$story) = $story->{body} =~ /(.{$len,}?\.["]?)\W<(p\|br\ +/?)>(.)/s;` [download]	[reply] [d/l]