in reply to Re: Code Critique
in thread Code Critique

That example is tame.What I'm writing now is running into 700 lines and, yep, you guessed it, understanding what the heck I was thinking is now becoming the problem - hence the need to get some rhyme and reason into what I'm doing.

And re it working - course it works!! It's a funny data set in this format

20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L and sometimes 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL And a myriad other units.
The task is to strip away the units and leave me with Date, Test, Data Being a pragmatist, getting things working is the challenge. But before packing my bags on this, I need to make it understandable if I ever need to revisit it. A dataset is around half a million lines, and the diferent types of units legion.

Replies are listed 'Best First'.
Re^3: Code Critique
by Marshall (Canon) on Oct 05, 2010 at 00:06 UTC
    Ah, your programs are becoming complex enough that the realization that there are always 2 people who need to understand the code is starting to sink in. Those 2 people are: (a) you when you write it and (b) you when you have to extend it or fix at some time in the future! And if it is a useful program, one of these 2 things and probably both will happen!

    Some simplifications to the code will be possible, based upon your first example, I recoded the first part with some straightforward techniques and with just borrowing pretty much your regex'es. Your actual record is more complex. However, I hope that the lesson about spacing and indenting is becoming more clear - it helps a LOT.

    #!/usr/bin/perl -w use strict; my $patient_id; #patient record for the CSV file my $nhs; my $dob; my $sex; my $postcode; my $total_patients=0; while (<DATA>) { s/^\s*//; #remove leading space s/\s*$//; #remove trailing space $patient_id = $1 if m/^Patient ID .*?(\d+)$/; $nhs = $1 if m/^NAD .*?(\d+)$/; $dob = $1 if m/^Date Of Birth .*?([\/\d]+)$/; $sex = $1 if m/^Sex .*?(\w+)$/; #record ends with postal code -> output to the CSV file if ( ($postcode )= m/^Postcode .*?(\w+)$/ ) { print "$patient_id,$nhs,$postcode,$dob,$sex\n"; #CSV line $total_patients++; # this is here instead of with $patient_id # because it is here that we have an # "official" patient record # this starts variables "over again". A runtime warning will # happen if for some reason one of these winds up still # being undefined when the record is printed to the CSV file # # note: doing this for $postcode is not necessary, but I # treat this variable same as the others because this lessens # the chance of a mistake when modifying the program later. $patient_id = $nhs = $dob = $sex = $postcode = undef; + } } print "Total Patients: $total_patients\n"; =prints 1234,5678,BH78,01/01/1965,yes 4321,999,XYZ98,10/10/2007,no Total Patients: 2 =cut __DATA__ Patient ID 1234 NAD 5678 Date Of Birth : 01/01/1965 Sex : yes Postcode : BH78 Patient ID 4321 NAD 999 Date Of Birth : 10/10/2007 Sex : no Postcode : XYZ98
      Simply moving the { makes such a difference and xxx = $1 again clarifies thinggs so much from an if loop. Thanks again.
Re^3: Code Critique
by Marshall (Canon) on Oct 05, 2010 at 01:02 UTC
    I'm not sure what you want in this second programming problem. Maybe you could explain it a bit more?

    This is a very simple CSV file (no quote characters or embedded quotes within quotes), so just split on the comma's and then do another split on space to get the basic value and measurement units from the line. I used what is called a list slice to assign directly to variables without any numeric subscript stuff.

    The "formula" for this sort of thing is: open files, read and separate the input data into one "record" (a "quanta" if you will that makes sense for processing), either save that for processing in a collection of those "records" or process the records as you go.

    In your first example, I don't think there is a need to save anything past the "current record". For this one, I don't know.. explain more and show more...

    #!/usr/bin/perl use strict; use warnings; # the same as #!/usr/bin/perl -w in first line while (<DATA>) { next if /^\s*$/; # skip blank lines # sometimes a last blank line # causes problems! s/\s*$//; # or chomp; fine also my ($date, $description, $measurement) = (split (/,/,$_))[0,1,3]; my ($value, $units) = split (/\s+/,$measurement); print "date = $date\n", "descript = $description\n", "value = $value\n", "units = $units\n\n"; # instead of print, do something here ... } =prints date = 20/08/2007 descript = Erythrocyte sedimentation rate value = 3 units = mm/h date = 20/08/2008 descript = Total white blood count value = 6.7 units = 10*9/L date = 04/04/2007 descript = Haemoglobin estimation 12.9 g/dL value = 12.9 units = g/dL =cut __DATA__ 20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL
    If a new problem is being discussed, please start a new node.
      It isn't a problem. What I was trying to do was give examples of the underlying data structures that was giving me the initial headache (but I forgot to show what comes after the headers that you cleverly deduced) The data structure is in the format
      Patient ID 1234 NAD 5678 Date Of Birth : 01/01/1965 Sex : yes Postcode : BH78 (as you guessed) Then goes on to the gubbins 20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL
      From that I split the data to 3 files One is a csv of basic information (anonomysed, all done within an encrypted face and DPA registered) and a second file of identifier, valuename, value A third file gives a simple list of every test that the script has come accross( that's the keys bit of the code) So I need to save the identifier as I go along, and add it to the lines that start with a date. Sorry for not being clear in the first place. But the next step will be to restructre all of this anong the lines that you've taught me. Thanks once again.
        Great!

        I did write some more code for you to demonstrate subroutines and I hope amplify my point about indenting.

        At some point, you may want to sort these test records by date. You actually have a very good basic date format to work from as there are leading zero's for the month and date.

        Below I show some code that has 2 subroutines, one to convert the file's date format into something that can be easily sorted as a simple string and one to convert it back. VERY important here is that it is easy to understand what code belongs to what subroutine!

        Clarity of thought and the grouping of those thoughts into logical, readable "units" is the single most important thing that you can do towards writing "great code".

        Go for the improvements and report back with progress! I understand more about your record format in the input file. And Perl has some very cool ways of dealing with record processing like this. But at this point, you need to do more work before things can proceed further. And I'm sure you will do that.

        #!/usr/bin/perl use strict; use warnings; # the same as #!/usr/bin/perl -w in first line while (<DATA>) { next if /^\s*$/; # skip blank lines chomp; # removes trailing \n but not spaces my ($date) = split (/[,]/,$_); print "date_original = $date\n", "date_sortable = ", year_first($date), "\n", "original date back = ", day_first(year_first($date)), "\n\n"; } sub year_first # convert 20/08/2007 to 2007-08-20 { my ($date) = @_; # one way to get the sub's input value my @tokens = split (/\//, $date); @tokens = reverse @tokens; my $new_date = join('-',@tokens); return $new_date; } sub day_first # convert 2007-08-20 to 20/08/2007 { my $date = shift; # another way to get a single value my @tokens = split (/-/, $date); @tokens = reverse @tokens; my $normal_date = join('/',@tokens); return $normal_date; } =prints date_original = 20/08/2007 date_sortable = 2007-08-20 original date back = 20/08/2007 date_original = 20/08/2008 date_sortable = 2008-08-20 original date back = 20/08/2008 date_original = 04/04/2007 date_sortable = 2007-04-04 original date back = 04/04/2007 =cut __DATA__ 20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL