in reply to Truncating real text

While I wont be offering code, I will offer some advice.

I would set up some options that you could use to define how what kinds of punctuation that you'd allow. Something along the lines of:
1. foo's
2. what in the (foo) did you say?
3. I'm going to "foo" you.
4. What I think of foo: good, bad, ugly
5. Man/Woman, which is it?
6. paragraph 1:

-- what's up?

-- Not much

--well foo want's to get a hold of you

As you can see, there's many options for punctuation and where you want to break, like do you want to traverse across paragraphs.

One question to ask is what it's for.. It almost seems like it'd be easier to just grab a certain number of characters from the text and then continue to the end of whatever word you're on.

For instance, in this sentence(s), just grabbing the first 25 characters, which puts you in the middle of the word 'sentence' and going to the end of that word, including puctuation etc. So one issue is whether or not getting the exact number of words is nessecary or if it's just getting so much of the text and then not munging the end on punctuation. A little clarification on what it's needed for will help you out here.

Replies are listed 'Best First'.
Re^2: Truncating real text
by bcole23 (Scribe) on Mar 17, 2005 at 00:58 UTC
    Hey, for once I'm going to actually put some code on one of these darn things!!

    use warnings; use strict; my $text = "I'd like a foobar(s)! What about you?"; print "Pretext: $text\n"; my $num_of_chars = 13; $text =~ s/(.{$num_of_chars}[\w|\(.*\)\w]*).*/$1/; print $text;


    OUTPUT:
    Pretext: I'd like a foobar(s)! What about you?
    I'd like a foobar(s)

    Now, all you'd really need to do is expand on the last part to correctly handle which punctation you want to handle for the last part.

    here's my second try if you want word counts

    use warnings; use strict; my $text = " I'd like a foobar(s)! What about you?"; print "Pretext: $text\n"; my $num_of_words = 6; my @text; for (my $i=1; $i<=$num_of_words; $i++) { $text =~ s/^\s*([\w|\(|\)|'|:|!|\.|\,|;]*\s*)(.*)/$2/; push (@text,$1); } for (@text) { print $_; }


    OUTPUT:
    Pretext: I'd like a foobar(s)! What about you?
    I'd like a foobar(s)! What about

    This is just taking the first word, regardless of puctuation, which may cause some grief, and the next space(s), and putting it in an array. A good check to put in there would be that if $text becomes undef to stop looping. Also, if you're reading in from a file line by line, what I'd do is get the first few lines and concatenate them together to form your text to get the data from if you're going over say, 50 or so words.

    Please note that I'm nowhere near the level of others here and the code above is laughable by perl monk standards, but it should help you get along. :)

    UPDATE: Updated code a bit.