Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am a novice in perl. I am trying to split a data into a summary based on the number of words(or sentences or letter).

$data = "How do you take paragraph or large amount of text and break i +t into sentences (perferably using Ruby) taking into account cases su +ch as Mr. and Dr. and U.S.A? (Assuming you just put the sentences int +o an array of arrays) UPDATE: One possible solution I thought of involves using a parts-of-s +peech tagger (POST) and a classifier to determine the end of a senten +ce: Getting data from Mr. Jones felt the warm sun on his face as he steppe +d out onto the balcony of his summer home in Italy. He was happy to b +e alive.";

I want to show the summary in another variable $data_summary, so,

$data_summary = "How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into..."

Can anyone help me in getting the $data_summary as above by splitting based on number of words, sentences or letters (I prefer based on number of letters).

Thank you in advance.

Replies are listed 'Best First'.
Re: Split a paragraph based on the number of letters
by 2teez (Vicar) on Feb 04, 2014 at 14:02 UTC

    Hi Anonymous Monk,
    I am a novice in perl. I am trying to split a data into a summary based on the number of words(or sentences or letter)...

    You are welcome to Programming in Perl, and every expert in here started out as a novice. It would be wonderful to see some effort on your part as regard the solution to this your post.
    However, I would give you a head up.
    Since you are trying to split a "dataset" based on "..number of words, sentences or letters (I prefer based on number of letters)" check usage of split then substr.
    You might also want to check How do I post a question effectively?

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
Re: Split a paragraph based on the number of letters
by kennethk (Abbot) on Feb 04, 2014 at 15:19 UTC

    Please read How do I post a question effectively? In particular, note that you should be providing desired output as well as some code that didn't work for you. I honestly have no idea what you mean by "splitting based on number of words, sentences or letters". If you can't write it in code, write it in pseudo-code and be explicit about your algorithm. The more specificity you can provide, the more inclined people will be to help and the better the help will be.

    The general challenge you describe is not easily solved, since English is chock full of idioms and peculiarities. Given the assigned spec, I would probably split on one or more whitespace characters that are preceded by periods, question marks or exclamation points but not preceded by a title (Mr., Dr., Mrs., Ms., esq., ...). This is by no means comprehensive, but it should get you through this task. Read perlreftut and see if you can translate the above spec into a regular expression. Of particular interest should be Looking ahead and looking behind. Alternatively, you could just simply split with /\.\s+/ and then stitch entries back together if there's a trailing title.

    How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby)...
    I think perhaps you've come to the wrong community. You should stay anyway, though, since we're pretty cool and generally helpful.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      I agree that the OP is not very clear. However, if the intention is actually to split a paragraph into sentences, I would strongly recommend using a module rather than trying to roll one's own parser.

      Here's an example using Lingua::EN::Sentence:

      use Lingua::EN::Sentence qw( get_sentences ); my $text = 'Is Mr. Hyde in? A. J. Smith Ph.D. said "Drop dead!"'; my $sentences = get_sentences($text); say for @$sentences;

      Output:

      Is Mr. Hyde in? A. J. Smith Ph.D. said "Drop dead!"

      Update: Minor wording changes; added output.

        I whole-heartedly agree. I also think the post had all the hallmarks of homework, and I suspect the professor would not accept a practical solution.


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.