in reply to CSV regex with hash/array program plan

campus1plb:

  1. I'd create the columns during runtime: that way, if someone adds a new subject, you don't need to change your program.
  2. Either data structure is fine. I'd personally go with a hash table, but only because that would feel more natural for me. If you're comfortable with an array, go with it.

Your third question is the fun part. I don't find that writing the regexes is that difficult. The problem is more 'are you sure you're getting everything correctly?'.

I generally attack it like this: write regexes for the first 5 or ten lines of data. Then make a program to match and delete all the requirements that it can. Then look at the next few lines of what's left, and add new regexes and/or altering existing ones. After a few iterations, you'll have regexes that can handle most of the data. You may have a few stragglers (misspellings, etc.) that may require a bit of playing with. You might, for example, first repair misspellings before matching requirements.

Have fun with it!

...roboticus

When your only tool is a hammer, all problems look like your thumb.

  • Comment on Re: CSV regex with hash/array program plan

Replies are listed 'Best First'.
Re^2: CSV regex with hash/array program plan
by campus1plb (Initiate) on Nov 23, 2014 at 18:55 UTC
    Thanks for the replies folks,

    The sort of thing i had in mind is

    use feature "switch"; for (to be determined) { when ((s/(\b[A-E]{3}\b)/XXX/g)) {@array[i,1] = $1} when (s/((Science)|(science))/XXX/g) {@array[i,2] = $1} when (s/((Math)|(math))/XXX/g) {@array[i,3] = $1} when - for all remaining subjects... when (s/((first|First)\W+(?:\w+\W+){1,10}?([A-E]{3})) #or some similar + REGEX returning only the first year entry grades default {} }

    but i'm concerned that i'm thinking very "C" in my iteration loop for the array

    To answer Anonymous monk,

    To start with, i'd be very happy with an output that looks like below:

    (subject)..|ABB|Maths|Physics|Design&Technology|Engineering

    If i can get the progam to do this above, then i'll try and refine it to pull out specific grades

    eg ([A-E]{1} in Design|design) returns the grade preceding that subject

    Roboticus, thanks for that i may start with a defined array (there are ~200 A level possibilities, but they change very infrequently)and then develop it into one which adds subjects when detected once i get the hang of it.

    You have also hit the nail on the head regarding the REGEXes, i was thinking about having a list produced of ones that the regex struggled with, or ones which didn't get any hits to see how i'm missing things too. I'd not thought about multiple passes however, nor repairing misspellings!

    really appreciate the input i just need to plough through some text books and remind myself (or learn new things) appropriate to the task in hand

    best wishes, Phil

      when ((s/(\b[A-E]{3}\b)/XXX/g)) {@array[i,1] = $1}

      Capture groups don't work the same way in  s/// substitution versus  m// matching:

      c:\@Work\Perl>perl -wMstrict -le "my $s = 'xxx aBc yyy dE zzz'; print qq{\$s: '$s'}; ;; print qq{\$1: '$1'} if $s =~ s{ [AaBbCcDdEe]+ }{XXX}xmsg; print qq{\$s: '$s'}; " $s: 'xxx aBc yyy dE zzz' Use of uninitialized value $1 in concatenation (.) or string at -e lin +e 1. $1: '' $s: 'xxx XXX yyy XXX zzz'
      Capture groups work in a potentially surprising way in  s/// substitution and  m// matching:
      c:\@Work\Perl>perl -wMstrict -le "my $s = 'xxx aBc yyy dE zzz'; print qq{\$s: '$s'}; ;; print qq{\$1: '$1'} if $s =~ s{ ([AaBbCcDdEe]+) }{XXX}xmsg; print qq{\$s: '$s'}; " $s: 'xxx aBc yyy dE zzz' $1: 'dE' $s: 'xxx XXX yyy XXX zzz'
      (note that only the last group matched is captured). Please see perlre, perlrequick, and perlretut.
      Also: I don't think  @array[i,1] = $1 is going to work the way you think it will whatever the value of  $1 may be (but I'm not sure just what you expect from this expression). Please see Slices in perldata. (Update: Something like this works for hashes: see  $; in perlvar. There's a more complete discussion of this old trick somewhere, but I can't locate it right now — anyone know where it is? (Update: Anonymonk informs me this is Multi-dimensional array emulation in perldata. This section was apparently added with Perl version 5.16.0 or 5.16.1. I only had 5.14 available locally and so missed it.))

        Something like this works for hashes: see $; in perlvar. There's a more complete discussion of this old trick somewhere, but I can't locate it right now — anyone know where it is?
        It's in perldata... 'Multi-dimensional array emulation'.