in reply to Truncate Data from MySQL

TIMTOWTDI (clumsier, but only slightly different):

#!/usr/bin/perl use strict; use warnings; # 777834 my (@copy, $copy, $i); @copy = split (/\s/, <DATA>, 16); for (0..14) { print $copy[$_] . " "; } __DATA__ Pull out only the first 15 words from the pubText field. This is where + I need suggestions, the code below does not work.

Replies are listed 'Best First'.
Re^2: Truncate Data from MySQL
by mzedeler (Pilgrim) on Jul 07, 2009 at 18:51 UTC

    /\s/ should be /\s+/ unless the empty string between two spaces counts as a word.

      That certainly is the right way to go... and cheap at the price. ++!

      Some minor quibbles though:

      1. OP offers no indication of actually having double spaces between sentences but that is a not uncommon occurance, which is why your observation is so valuable: Put two spaces rather than one in "...field. This..." in my __DATA__ and my split pattern does NOT DWIM) whereas yours does.
      2. The sample I used, from the OP, has no doubled spaces.
      3. Whether or not the db's text field has doubled spaces depends on how it was created. If it was simply scraped from a webpage, odds are that it has none, since browsers (and I believe, browser-substitutes) do not render but one in any string of literal whitespaces (character entities are, of course, a differnt matter).

      For some reason, your "...unless the empty string between two spaces counts as a word." does not parse to anything plausible (possible blind spot?) for me. FMI, is there a way to persuade split to treat the empty string between two spaces as a word boundary (\b) or a not_word boundary (\B)?

      Update: Oversight addendum: "the empty string between two spaces" is a position (despite cf perldoc -f split at "As a special case for "split", using the empty pattern "//"....")

        "The empty string between two spaces" is a funny wording. All I mean is that between any two neighbouring chars, you can say there is any number of zero-length strings ($a = '1'; $b = '2'; $empty = ''; $c = "$a$empty$b"; then $c eq "$a$b" and $c eq "$a$empty$b" and $c eq "$a$empty$empty$b" ...). I am aware that when using perl to extract zero length character sequences using split or regular expressions, it returns undef.