Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a list (this is not the real list, but one I can show you to demonstrate what I'm working with) that has comma separated values plus a space. It's actually not even a list yet, it's a string I'll make into a list.
hi, there, world, how, are you, today?
That I need to split apart each comma-separated word/phrase and break into a hash. From here I will count unique words, that's the whole object of this.

The question I guess is how to make a $string break and separate everything and place each word or pharse into an array. You'd need to use a foreach but I can't seem to figure out how to set it up. If someone can help with this, I could do the rest.

please do NOT solve this entire thing for me, do NOT post the source code that will split the string and add it to a hash and then count like words. Just help me get started please.

Replies are listed 'Best First'.
Re: Comma separated list into a hash
by davido (Cardinal) on Apr 26, 2004 at 03:24 UTC
    please do NOT solve this entire thing for me...

    Bravo!

    There are several issues to consider.

    First, is it possible that each of your substrings will contain commas themselves? From the looks of your list, the answer is no. Second, do you want punctuation (such as the trailing '?' question mark) to be included as part of that word?

    The simplest solution will be to have a look at split, and use it with a comma followed by zero-or-more spaces as the delimiter.

    Then, if you desire to strip away punctuation such as that trailing question mark, you can use a substitution operator (s///). See perlrequick andperlretut for details there. As a matter of fact, for simple character by character stripping of non-word characters, you can also use tr///. That's also covered in perlrequick if I'm not mistaken. Those are also covered in perlop and perlre, though those documents are more difficult reads.

    But if you're dealing, instead, with a more complex dataset, where commas may be included within the text so long as the individual text element is itself quoted, then you had better look at Text::CSV where it's done right. That's an easy thing to goof up if you try to do it all with home-rolled regexps.

    I hope this gets you going in the right direction without giving away the solution. The biggest step you can take in the right direction while learning Perl is learning to utilize the POD's (Plain Old Documentation that comes with Perl).

    I'd like to see what you come up with, so if you get stuck, or once you complete it, post a followup to this thread so we can have a look at what you were able to put together. Then if you want, we can critique it for you to enhance your learning experience. :)


    Dave

      Thanks to you and everyone else who didn't post full solutions as I sometimes get when I post here, better learning experience when you do most of it yourself.

      I looked at split originall and that's where I got confused. You need to set up variables prior to splitting, it looks like you need to know how many splits there are so you can setup the variables.

      How could I use a foreach and split a $string? This concept is way above me or is so trivial I'm thinking to hard. lol

      Thanks everyone!

        The input to split is (usually) a Regex, followed by a string to split. And the output (the return value) is a list of each individual item split out.

        So the idea is that you will put your string to be 'parsed' into a scalar variable, say, for example, $string. You will then assign the output of split to a list.

        If I were splitting a string containing numbers seperated by whitespace, I might do this:

        my $string = "1 3 5 7 11 13"; my @numbers = split /\s+/, $string;

        Now @numbers is an array holding six elements, each element containing one number.

        You don't have to split into an array though. You can split such that the return value is used to populate 'foreach' with data to iterate over. My previous example might look like this:

        my $string = "1 3 5 7 11 13"; foreach my $number ( split /\s+/, $string ) { # Do some stuff with $number. # Number will contain one number from the list each time # through the loop. }
        Hope this helps...


        Dave

        Note that split returns an array so you don't need to know "how many splits" are returned in order to use it.
Re: Comma separated list into a hash
by tkil (Monk) on Apr 26, 2004 at 05:50 UTC
    The question I guess is how to make a $string break and separate everything and place each word or pharse into an array. You'd need to use a foreach but I can't seem to figure out how to set it up. If someone can help with this, I could do the rest.

    Well, you already know you can use a foreach. Taking a look at perlsyn, we see that foreach expects a LIST; conveniently enough, that is exactly what is returned by split:

    foreach my $word ( split /,\s*/, $string ) { # do stuff with $word here }

    Hope that wasn't too much of a spoiler! Remeber: LISTs are syntactic constructions (things that show up in program syntax); arrays are things that exist in memory. You can assign into an array from a LIST, and you can feed arrays into things that expect LISTs. But the two are, in fact, different things.

Re: Comma separated list into a hash (SPOILER)
by davido (Cardinal) on Apr 26, 2004 at 07:45 UTC
    In my initial response to your question I mentioned that for robustness in getting through comma delimited text, the Text::CSV module is a good idea. That said, I don't expect that you'll be using it, because it sounds like your problem is homework-related, and thus, unless you really understand the module, it's probably best to stick to the coursework and not start introducing things that you haven't covered in class yet.

    Within your problem, there is also the issue of what constitutes a word. I'm going to ignore the fact that a word cannot contain two hyphens next to each other, or two apostrophes, etc. For one thing, once I start down that road, the next thing you know, I'll be looking for spelling errors, and that's just beyond the scope of actual need. For the purposes of my example, I'll just strip anything that doesn't belong in a word out of a word, including punctuation, and assume that what's left is a word.

    I decided to interpret your question as saying that you have a set of comma delimited strings, and that each substring might contain multiple words, but that you want to get a total word-count. I realize that you might want phrase-counts instead of word counts, but this is my spoiler, so I'll pick word-counts because doing so adds an extra level of fun.

    I took the additional liberty of lower-casing all words, so that comparing "ApPleS" to "apples" and "APPLES" (but not "oranges") will be all the same thing.

    In this example, I also made sure that lexical variables all fall out of their narrow a scope as early as possible. That's the sole reason for the outter-most { ... } block. ...It's really not necessary, but I was just fiddling and it came out this way.

    If you're ready for the spoiler, read on. If you're not ready for it, don't:

    Enjoy! Thanks for the fun question. Finally I found a reason to install Text::CSV.


    Dave

Re: Comma separated list into a hash
by dragonchild (Archbishop) on Apr 26, 2004 at 11:32 UTC
    An alternative to Text::CSV is Text::xSV, which is written in pure Perl and very easy to understand (once you get past a few advanced constructs). Text::CSV also has a pure Perl version, but it's not as easy to understand. YMMV

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Re: Comma separated list into a hash
by gsiems (Deacon) on Apr 26, 2004 at 03:33 UTC

    Now I'm in a quandry, how much hint is too much? If I am of understanding, the solution would be little longer than the hint(s).

    Is this a small enough hint (in order of appearance)?

       split
       foreach
       ++
    
      I'd think you could do
      keys split
      And a bunch of (missing :) ) punctuation..
        Now you've done it. You do realize that I now feel compelled to figure out how this is done. :^)
Re: Comma separated list into a hash
by bart (Canon) on Apr 26, 2004 at 11:26 UTC
    I'll give you another possible approach to split which is what most people here pointed you to. In fact it's its total opposite: split allows you to specify what is between what you want to grab, my in approach you have to specify what you want to grab. A simple code example is:
    my @words = /\w+/g;
    Parens are optional in this case, it acts the same as this:
    my @words = /(\w+)/g;

    Now, as an exercise, how does it work? :)

    p.s. The code abbove expects the string in $_. If you want to use a different variable, use the syntax:

    my @words = $string =~ /(\w+)/g;
      Your approach grabs words. The issue the OP is attempting to solve is to grab stuff between commas. A better approach would have been to do something like: But, that suffers from the same problems that split does when dealing with CSV data, specifically how to handle commas that belong in the data value. This is why regexen and split are poor choices for dealing with CSV data. Parsers are appropriate.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        how to handle commas that belong in the data value

        If quotes are used to disambiguate, it's fairly easy to parse with a regex:

        Of course, that suffers from other problems, such as going all wonky with unbalanced quotes. But it's a fairly simple way to parse well-formed data, and I thought it might be helpful for some to look at.

Re: Comma separated list into a hash
by blue_cowdawg (Monsignor) on Apr 26, 2004 at 03:24 UTC

        Just help me get started please

    First clue: perldoc -f split

Re: Comma separated list into a hash
by Anomynous Monk (Scribe) on Apr 26, 2004 at 15:34 UTC
    One alternative to an explicit loop is to use a hash slice to set multiple keys at once. If you are not using the values, only using a hash to count unique keys, you can even assign an empty list. Looks like this:
    # yes, the %hash gets a @ in front to show that we are going to use mu +ltiple keys; perl knows it's really %hash being used instead of @hash + because of the {} @hash{KEYLIST} = (); print "unique keys: ", scalar keys %hash;
    where KEYLIST can be any expression returning a list, e.g. split ...