rje has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow perlmonks,

There is a text-based data format used in the Traveller role-playing game that describes the general characteristics of a star system. The format has a few varieties, but Perl is well-suited to parse it. My question to you is: what methods would you suggest to parse it?

First, the data assumes the following forms:
Old Style 0101 A123456-7 B Lo Po De A G Modern Style 0101 A123456-7 B Lo Po De A 321 Im G5 III Extended Style 0101 A123456-7 B Lo Po De A 321 Im G5 III :0102,01 +03,0104 ^^^^^^^^^^^^^^

The "under-careted" fields above are the anchor, the fields which always retain a specific format and width.

All three formats have the same initial fields:
  • The first group of columns, usually (but not always) 15 characters long, contains the system name. This name can have spaces and nonalpha(\W) characters.
  • A 4-digit cordinate number is next (\d{4})
  • A 9-character data string is next (\w{7}-\w).
  • After 1-2 spaces, there may be an optional character.
  • After this, a set of codes, all 2+ chars long. These codes can safely be ignored.
  • Then an optional "travel advisory code", A=Amber, R=Red.

    In the "Old Style", a trailing 'G' denotes the presence of a gas giant, while its absence means there ain't one.

    In the other styles, there instead are three digits which represent yet more data, followed by a 2-character allegience code and one or more star classifications.

    Finally, in the extended style, there may be a colon followed by a comma-separated list of trade route indicators.

    In the past, I have used a pair of gargantuan regular expressions to rip out the data; however, I've been thinking there's more elegant ways to deal with it.
    For instance, I can take a lot of data out with a match and a subsequent split:
    foreach (@row) { my ($name, $rest) = /^(.){15}(.*)$/; my @data = split( ' ', $rest, 3 ); }
    But really it's better to just do an initial match:
    my ($name, $loc, $upp, $rest) = /^(\w.*\w)\s*(\d{4})\s*(\S{9}) (.*) +$/;

    Then the 'rest' can be determined and the data parsed out as necessary:
    my ($code, $pbg, $allegiance, $more) = $rest =~ /(A|R)?\s+(\d\d\d)\ +s+(..)\s+(.*)$/; or my $gg = $rest =~ /G/;

    It's an interesting problem to me, maybe a golf problem, and I'm interested in seeing what folks can come up with.

    rje
  • Replies are listed 'Best First'.
    Re: Traveller Parsing
    by rje (Deacon) on Dec 03, 2003 at 18:18 UTC
      Well, it looks like a flexible way for dealing with this data is to first rip out the 'mandatory' or 'anchored' elements, and then tokenize the rest of the line and examine it word-by-word. Thus I'm not tied to whitespace conventions at all, which is nice.

      NOW I have to decide how to package the 'parser'. There are a number of things to think about, including what data should be classed and what data should just exist.

      I figure there will have to be a class to represent "The Universe": a top-level container to hold discrete chunks of space, which represent a parsec of data each.

      Next, the 'chunk of space' itself must be a class, as it contains zero or more bodies in space. In Traveller parlance this is called a 'hex', though it could more generally be called a 'parsec'.

      Next, the bodies themselves must be classes: asteroids, moons, planets, gas giants, and stars. Some of these objects can 'contain' objects in an orbit, and some can contain objects on their surface.

      At this point, it might be nice to see if CPAN has some astronometric packages available that I can use instead of writing a half-baked variety of my own.

      ...well, there are astro modules there, but they only look sort of generically useful -- functions and interfaces for real-life data, etc. Anyone have any suggestions?