Stefany has asked for the wisdom of the Perl Monks concerning the following question:

Greetings wise brothers and sisters!
I am writing a csv extractor Perl program ( I know about the CSV module, but I wanna write my own program to learn ) and i am trying to match the header columns.

So I far I have decided that the first line that has the delimeter between two words multiple times must be the header columns. So will you please tell me how to match a:
1. comma between two words like this - "Entries","Entry Time","Visit Length","Browser"
2. Match is multiple times a comma I tried many different regexes with no luck, please help!

Replies are listed 'Best First'.
Re: Match a comma between two words
by davido (Cardinal) on Jul 02, 2014 at 19:55 UTC

    my @headings = split /(?<=")\s*,\s*(?=")/, $line;

    But that makes no attempt at tokenizing and parsing the line for balanced quotes, escaped quotes in a field, or quoted commas. If you need to deal with those issues, you've got more work ahead of you, the complexity of which is why people tend to recommend and use Text::CSV.


    Dave

Re: Match a comma between two words
by AppleFritter (Vicar) on Jul 02, 2014 at 20:38 UTC

    1. comma between two words like this - "Entries","Entry Time","Visit Length","Browser"

    So this would be complicated by the fact that commas inside quoted text should not count, right? In other words, "Name, First","Name, Last","Birthday" should be split into "Name, First", then "Name, Last", and then "Birthday"?

    I think that's starting to become a bit complicated for a single regex, though you could do it using match-time code evaluation and the \G assertion ("picking up where you left off"), perhaps, keeping track of whether you're in a quoted string or not as you go along.

    A parser based on a formal grammar might be a better idea; take a look at Parse::RecDescent if you'd like to go that route.

    Yet another option would be to use a loop and process the line in chunks: read from the beginning of (the current remainder of) the line up to the first comma or quote, keep track of whether you're in a quote right now etc., and use that information to either add the newly-read chunk to your current extracted column name, or increase your column counter and start a new column name with that chunk.

    Since you want to learn how to do this on your own, I won't provide any code. :)

    Now, all that said...

    If commas are guaranteed to not appear in your quoted strings, everything becomes much easier, of course, \G alone should be enough to get the job done, though you could save yourself the trouble and simply split the line.

    Hmm. I wonder if you could use split and then pull some clever tricks to glue the right bits together again in the general case, too. Well, you'll find out, if you decide to go down this route!

    2. Match is multiple times a comma I tried many different regexes with no luck, please help!

    I'm sorry, I don't know what you mean there.

Re: Match a comma between two words
by Laurent_R (Canon) on Jul 02, 2014 at 20:41 UTC
    Very simple CSVs can be parsed with the split function. But yours has separators AND delimiters, this is already not a simple CSV and this would amply justify a module such as Text::CSV. However, looking at your specific example, if you really want to do it yourself, you could first remove quote marks and then split on commas:
    $line =~ s/"//g; my @fields = split /,/, $line;
    (It could also be done the other way around, first splitting on commas and then removing quote marks.) But now, either way, think about what will happen when your input line is:
    "New York, NY","Entries","Entry Time","Visit Length","Browser"
    Both ways, it will break with an unwanted extra field, and you'll be out of luck. We do use quite commonly split to process simple CSV files (although our separator is usually a semi-colon rather than a colon), but only (well almost only) on CSV files that we have created beforehand and in which we fully master the content and we know we are not gonna have a bad surprise.

    On files coming from an outside organization on which you have no control, don't do it. In your case, the simple fact that there are both separators and delimiters is a sign that there is a clear danger that your process is probably going to break on the first special case coming up. The Text::CSV module is really a much more robust alternative in any case where you don't control everything yourself.

Re: Match a comma between two words
by Anonymous Monk on Jul 02, 2014 at 22:51 UTC

    I know about the CSV module, but I wanna write my own program to learn ... So will you please tell me how to match a:

    Super Search , search for CSV and for Text::CSV and you'll find all the hundreds of nodes on this exact thing

    Then pick through them :)

Re: Match a comma between two words
by GotToBTru (Prior) on Jul 03, 2014 at 05:03 UTC

    Here is something similar to what you are looking for. Perhaps this will help you work out your particular solution. (I have a similar situation, dealing with csv files and not having the very useful modules available to me.) This code replaces commas inside of quotation marks with spaces so I could subsequently use split.

    $olds='one,two,"three,four,five","six",seven,"eight,nine"'; $i = 0; $news = join '', grep { $i++ % 2 ? {s/,/ /g} : 1 } split /"/,$olds; print $news . "\n";

    Output:

    one,two,three four five,six,seven,eight nine

    Update

    Original version would delete single word inside quotation marks.

    1 Peter 4:10
A reply falls below the community's threshold of quality. You may see it by logging in.