in reply to Code Critique

The lack of any systematic indenting and spacing of the code is a very serious flaw. Trying to read your code is like trying to read a book which has no chapters, paragraphs or even punctuation marks! This is so serious that this code would not get more than a "D" in a college class even if it did work. I'm not saying that to be mean or condescending, you may not be aware of just how important this.

There is more than one way to do the indenting. As a beginner at this, I would suggest the "line up the braces method", like the example below. Use 3 or 4 extra spaces between levels. Alternately, you can use the diagonal brace method - its just as valid, but I suggest trying this method for awhile and see how you like it. Judiciously applied whitespace is one the most important things that you can do to improve readability!

while (condition) { .... if () { .... } .... }
Another systemic flaw is that your code doesn't have locality of functionality - related things are in widely separated areas. Some examples... I would open all of the files at the beginning of the code. BTW, I suspect some better names than "OUTPUT3" or "OUTPUT2"can be thought of... like maybe CSV or SORTED? I see print OUTFILE3 $lookup;. Don't defer output til later. Do right after you calculate it! I don't think that you even need this $lookup variable...print the lines as you go. print CSV "$ID,$nhs,$postcode,$dob,$sex\n"; Likewise, print OUTFILE2 @sorted;. Put that right after the sort.

A whole bunch of these chomp; statements appear to unnecessary. Try to factor out common code. Perhaps;

while (my $line = <INFILE>) { $line =~ s/^\s*//; # remove leading whitespace $line =~ s/\s*$//; # remove ending whitespace (includes \n) # chomp is not necessary ... }
Anyway think about the suggestions you've gotten and take another stab at this. Get it running with "use strict; use warnings;" and clean-up formatting.

Replies are listed 'Best First'.
Re^2: Code Critique
by rhiridflaidd (Novice) on Oct 04, 2010 at 23:32 UTC

    That example is tame.What I'm writing now is running into 700 lines and, yep, you guessed it, understanding what the heck I was thinking is now becoming the problem - hence the need to get some rhyme and reason into what I'm doing.

    And re it working - course it works!! It's a funny data set in this format

    20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L and sometimes 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL And a myriad other units.
    The task is to strip away the units and leave me with Date, Test, Data Being a pragmatist, getting things working is the challenge. But before packing my bags on this, I need to make it understandable if I ever need to revisit it. A dataset is around half a million lines, and the diferent types of units legion.

      Ah, your programs are becoming complex enough that the realization that there are always 2 people who need to understand the code is starting to sink in. Those 2 people are: (a) you when you write it and (b) you when you have to extend it or fix at some time in the future! And if it is a useful program, one of these 2 things and probably both will happen!

      Some simplifications to the code will be possible, based upon your first example, I recoded the first part with some straightforward techniques and with just borrowing pretty much your regex'es. Your actual record is more complex. However, I hope that the lesson about spacing and indenting is becoming more clear - it helps a LOT.

      #!/usr/bin/perl -w use strict; my $patient_id; #patient record for the CSV file my $nhs; my $dob; my $sex; my $postcode; my $total_patients=0; while (<DATA>) { s/^\s*//; #remove leading space s/\s*$//; #remove trailing space $patient_id = $1 if m/^Patient ID .*?(\d+)$/; $nhs = $1 if m/^NAD .*?(\d+)$/; $dob = $1 if m/^Date Of Birth .*?([\/\d]+)$/; $sex = $1 if m/^Sex .*?(\w+)$/; #record ends with postal code -> output to the CSV file if ( ($postcode )= m/^Postcode .*?(\w+)$/ ) { print "$patient_id,$nhs,$postcode,$dob,$sex\n"; #CSV line $total_patients++; # this is here instead of with $patient_id # because it is here that we have an # "official" patient record # this starts variables "over again". A runtime warning will # happen if for some reason one of these winds up still # being undefined when the record is printed to the CSV file # # note: doing this for $postcode is not necessary, but I # treat this variable same as the others because this lessens # the chance of a mistake when modifying the program later. $patient_id = $nhs = $dob = $sex = $postcode = undef; + } } print "Total Patients: $total_patients\n"; =prints 1234,5678,BH78,01/01/1965,yes 4321,999,XYZ98,10/10/2007,no Total Patients: 2 =cut __DATA__ Patient ID 1234 NAD 5678 Date Of Birth : 01/01/1965 Sex : yes Postcode : BH78 Patient ID 4321 NAD 999 Date Of Birth : 10/10/2007 Sex : no Postcode : XYZ98
        Simply moving the { makes such a difference and xxx = $1 again clarifies thinggs so much from an if loop. Thanks again.
      I'm not sure what you want in this second programming problem. Maybe you could explain it a bit more?

      This is a very simple CSV file (no quote characters or embedded quotes within quotes), so just split on the comma's and then do another split on space to get the basic value and measurement units from the line. I used what is called a list slice to assign directly to variables without any numeric subscript stuff.

      The "formula" for this sort of thing is: open files, read and separate the input data into one "record" (a "quanta" if you will that makes sense for processing), either save that for processing in a collection of those "records" or process the records as you go.

      In your first example, I don't think there is a need to save anything past the "current record". For this one, I don't know.. explain more and show more...

      #!/usr/bin/perl use strict; use warnings; # the same as #!/usr/bin/perl -w in first line while (<DATA>) { next if /^\s*$/; # skip blank lines # sometimes a last blank line # causes problems! s/\s*$//; # or chomp; fine also my ($date, $description, $measurement) = (split (/,/,$_))[0,1,3]; my ($value, $units) = split (/\s+/,$measurement); print "date = $date\n", "descript = $description\n", "value = $value\n", "units = $units\n\n"; # instead of print, do something here ... } =prints date = 20/08/2007 descript = Erythrocyte sedimentation rate value = 3 units = mm/h date = 20/08/2008 descript = Total white blood count value = 6.7 units = 10*9/L date = 04/04/2007 descript = Haemoglobin estimation 12.9 g/dL value = 12.9 units = g/dL =cut __DATA__ 20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL
      If a new problem is being discussed, please start a new node.
        It isn't a problem. What I was trying to do was give examples of the underlying data structures that was giving me the initial headache (but I forgot to show what comes after the headers that you cleverly deduced) The data structure is in the format
        Patient ID 1234 NAD 5678 Date Of Birth : 01/01/1965 Sex : yes Postcode : BH78 (as you guessed) Then goes on to the gubbins 20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL
        From that I split the data to 3 files One is a csv of basic information (anonomysed, all done within an encrypted face and DPA registered) and a second file of identifier, valuename, value A third file gives a simple list of every test that the script has come accross( that's the keys bit of the code) So I need to save the identifier as I go along, and add it to the lines that start with a date. Sorry for not being clear in the first place. But the next step will be to restructre all of this anong the lines that you've taught me. Thanks once again.