Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file that has numerous entries that look like the following:
Drive at Tray 0, Slot 5
Raw capacity: 68.366 GB
Usable capacity: 67.866 GB
Current data rate: 2 Gbps
Product ID: ST373453FC
Mode: Unassigned
What i want is the average Raw Capacity(68.366 in this example)for all unassigned drives. The problem is their are assigned and unassinged drives. so is their anyway to search on Unassigned and get Raw capacity for just unassigned? thanks for any help.

Replies are listed 'Best First'.
Re: Get Unassigned drive average
by BrowserUk (Patriarch) on Dec 30, 2005 at 14:20 UTC

    Something like this?

    #! perl -slw use strict; $/ = 'Drive'; while( <DATA> ) { print "$1 : $2" if m[(Tray .*?)\nRaw capacity: (.*?)\n.*Unassigned +]s; } __DATA__ Drive at Tray 0, Slot 1 Raw capacity: 68.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Assigned Drive at Tray 0, Slot 2 Raw capacity: 68.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Unassigned Drive at Tray 0, Slot 3 Raw capacity: 68.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Assigned Drive at Tray 0, Slot 4 Raw capacity: 68.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Unassigned Drive at Tray 0, Slot 5 Raw capacity: 68.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Assigned

    Output

    P:\test>junk3 Tray 0, Slot 2 : 68.366 GB Tray 0, Slot 4 : 68.366 GB

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Get Unassigned drive average
by TedPride (Priest) on Dec 30, 2005 at 19:48 UTC
    Yes, but what if the format turns out to not be exactly the same as shown? It's safer to record every raw capacity in turn, then if you run across Unassigned, add the raw capacity to the total:
    use strict; use warnings; my ($raw, $capacity, $unassigned); while (<DATA>) { $raw = $1 if m/Raw capacity: (\d+\.\d+) GB/; if (m/Mode: Unassigned/) { $capacity += $raw; $unassigned++; } } print int ($capacity / $unassigned * 1000 + .5) / 1000; __DATA__ Drive at Tray 0, Slot 5 Raw capacity: 68.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Unassigned Drive at Tray 0, Slot 5 Raw capacity: 48.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Assigned Drive at Tray 0, Slot 5 Raw capacity: 88.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Unassigned

      I'm not quite sure why the output from the utility producing this file would suddenly start to vary, but you have a point about resilience.

      However, your code provides very little (if an) extra resilience to mine, and exposes several extra weaknesses:

      1. If a record has the 'Unassigned' line, but no 'Raw capacity' line, then you will wrongly use the capacity from the preceding record.
      2. If the record correctly has both the required lines, but they are in reverse order, you would again wrongly use the preceding records capacity.
      3. You've an extra dependency that string 'Mode: Unassigned' be present, formatted exactly as specified, and correctly spelt.
      4. You've added the constraint that the drive capacity be specified with at least one decimal place on the figure, and that it be reported in GB.

        What happens if the drive has exactly '80 GB' or is reported in 'MB' or 'TB' or 'GiB' or 'Gigabytes' or...?

      The general rule with regex, (that I follow since someone here suggested it to me way back), is to specify the regex as loosely as possible commensurate with obtaining the information required.

      I'd also suggest that processing multi-line records, line-by-line is a dangerous practice if there is any scope for variability in the the number, or ordering, of the lines that make up those records.

      All that said, you have a point regarding resilience, and here is a technique that allows for some considerable resilience in ordering of elements, whether single or multi-line, whilst avoiding most of the traps:

      Which produces:

      P:\test>junk3 0, Slot 2 : 68.366 GB 0, Slot 3 : 68.366 GB 0, Slot 4 : 68.366 GB 0, Slot 5 : 68.366 GB Badly formatted record: ---------------------------------------- at Tray 0, Slot 6 Raw capocity: 68.366 GB Usable capacity: 67.866 GB Current data rate: 2 Gbps Product ID: ST373453FC Mode: Unassigned ----------------------------------------

      The basic idea is to place the captures within zero-length assertions so that the the ordering of the elements captured can vary completely, but the match and captures will still be made if all the required elements are present. It also ensures that the same elements will appear in the same capture vars ($1,$2 etc.) regardless of their ordering in the record; which avoids the problem of knowing what has been captured to where.

      An extension of this technique is that it allows you to specify all the elements to be captured in a different order (in the regex) to the order in which they will appear in the data. This is extremely useful when some elements are optional, as you can arrange for the non-optional elements to be returned first and so avoid the game of deciding what got captured into each of the capture vars.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.