Joes has asked for the wisdom of the Perl Monks concerning the following question:

Could you please consider the writing of a Perl script which would enable me (and many others) to convert a non standard GEDCOM file to a standard format acceptable for registration by the Personal Ancestral File system?

I use REUNION as my genealogy program, but it uses Windows based characters, whereas the standard format for genealogy data (GEDCOM) transfer uses ASCII, and a strict format that REUNION does not conform with, I have just recently found out.

GEDCOM is a simple text file, with a specific data structure. Every line has a header, consisting of a number, a space and a 4 characte standard identifier. Each set of data commences with the number 0 (zero)
The main amendments to be done are for every occurrence:
HEAL is changed to NOTE Medical:
HIST is changed to NOTE Biography:
EDUC is changed to NOTE Educated:
RESI is changed to NOTE Resided:
OCCU is changed to NOTE Occupation:

The main problem is in the following:

After each Source identifier 0 @Snnn@ SOUR, each line of the REUNION GEDCOM source data is to be modified as follows:

The first line 1 NAME data......... has to be changed to 1 TITL;

The second 1 NAME data......... has to be changed to 1 TEXT;

The third and subsequent 1 NAME data......... has to be changed to 2 CONT ( extra lines only required if more than 2 lines before the next 0 @Snnn@ SOUR;

The text file has something like 30,000 lines, a bit too many to do manually. There are 400 sources, each starting with 0 SOUR @Snnn@;

If you can assist, I would greatly appreciate it, but I will understand if you cannot.

I can send a sample GEDCOM file to examine.
My kind regards,
Joe

Townsville, Australia

Edit: chipmunk 2001-07-07

Replies are listed 'Best First'.
Re: Converting GEDCOM files
by chipmunk (Parson) on Jul 07, 2001 at 18:23 UTC
    Perl is very well-suited to doing text manipulations. A script to do this conversion might look something like this:
    #!/usr/local/bin/perl -w use strict; $/ = 'SOUR'; # read in one record at a time $^I = '.bak'; # modify input files in place, save originals with .bak my %convert = ( HEAL => 'Medical', HIST => 'Biography', EDUC => 'Educated', RESI => 'Resided', OCCU => 'Occupation', ); my $convert_re = join '|', keys %convert; $convert_re = qr/\b($convert_re)\b/; while (<>) { s/$convert_re/NOTE $convert{$1}:/g; s/1 NAME data/1 TITLE/; s/1 NAME data/1 TEXT/; s/1 NAME data/2 CONT/g; } continue { print; }
    I expect I haven't gotten that quite right. (Does '1 NAME data' appear in the file literally, or is that what certain strings in the file look like?) It would be very helpful to see a snippet of the input file. :) Anyway, that should get you started.
      Thanks Chipmonk -your offer is very much appreciated. I am brand new to this Monastery, so how do I contact you to send a sample .ged file, and one showing the expected outcome? Joe Email: slaven01@ozemail.com.au
There's already a GEDCOM module on CPAN
by petdance (Parson) on Jul 07, 2001 at 20:01 UTC
    There's already a GEDCOM.pm module on CPAN. It works pretty well, although it's not as robust as one might like. I'm considering redoing it w/Parse::RecDescent in my Copious Free Time (tm).

    xoxo,
    Andy
    --
    <megaphone> Throw down the gun and tiara and come out of the float! </megaphone>

      Thanks Andy. As a raw beginner (initiate) to the monastery, what is CPAN, and how do I find the module you refer to? My thanks for responding, Joe.

        CPAN is the Comprehensive Perl Archive Network. It contains a vast array of modules. A module is a library of code that does some related task like parse CGI data for web programs. I had a quick look at Gedcom.pm and while it could help do what you want it was not designed for the task at hand. It was more designed to manipulate Gedcom files in a way that would let you write your own programs.

        For a guide to all this stuff see this tutorial New Monks. It was written to help answer all the sort of questions you probably have.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Converting GEDCOM files
by lshatzer (Friar) on Jul 07, 2001 at 17:58 UTC

    Just doing a quick search, I found a few things.

    1. Reunion appears to be software for the mac, so this might cause some of the problems when importing into a PC (into PAF).
    2. Check out this link, it is on ReunionTalk's message board, about a similar problem.
    3. On another message board, I found this which also might be of some help.
    4. Doing some searches on google for GEDCOM and REUNION with some other keywords such as conversion, etc. might be of some help.
      Thanks for your research. Yuo were my first ever response on this PM site
      It opened up a few new areas for me.
      Reunion was originally develeoped for both Mac and for Windows, but a large software corporation bought out the Windows system, and re-released it under Generations.

      My kind regards,

      Joe, townsville

Re: Converting GEDCOM files
by tachyon (Chancellor) on Jul 07, 2001 at 19:39 UTC

    I can't quite get the jist of what you want with the SOUR part. As you are a fellow Aussie I will post a solution for you if you post: A sample of the SOURCE file with a corresponding example of how the OUTPUT should look. Add any notes before, after but not in the SOURCE and OUTPUT.

    Here is some code to do the first bit you want. You will need perl to use this. For details on getting perl see New Monks part 7. Do you want the latest version of Perl?

    #!/usr/bin/perl -w use strict; my @data; # define a hash of changes to be made my %change = ( HEAL => 'NOTE Medical:', HIST => 'NOTE Biography:', EDUC => 'NOTE Educated:', RESI => 'NOTE Resided:', OCCU => 'NOTE Occupation:', ); # accept input via three mechanisms # if no file specified here: my $file = ''; # then look for a command line argument here: $file = shift @ARGV unless $file; # if we still have no file to process # enter interactive mode here: unless ($file) { print "Please enter a file to process: "; chomp($file = <>); } # get the data open (FILE, "<$file") or die "Unable to open $file: $!"; @data = <FILE>; close FILE; # make the changes for (@data) { for my $key (keys %change) { s/$key/$change{$key}/g; } } # print the changes to a new file with the same name # as the original plus a .new suffix. open (FILE, ">$file.new") or die "Unable to write $file.new: $!"; print FILE @data; close FILE; print "Munged $file no worries!\n";

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Thanks Tachyon -your offer is very much appreciated. I am brand new to this Monastery, so how do I contact you to send a sample .ged file? PS liked your hang glider - flying for 20 years? and still around to tell the tale??

        Yes still here. </BEGIN_RANT> Stats are that you are about twice a likely to kill yourself in a hang glider as in a car. However it is about 5 times safer than riding a motor bike. Twice as safe as horse riding..... Life is a terminal disease you know </END_RANT>

        Anyway I have added an email link to my home node at the bottom so if you don't want to post details here (which would be considered usual) then send it to me there. It is generally considered proper netiquette to continue this discussion in this forum rather than in private but as everyone knows we Aussies have no manners :-) Some of the advantages of doing things in a open forum include allowing others to learn from the discussion, getting second, third,.... opinions on the quality of the code and advice presented. Allowing Monks to show off their skills to their peers is also a big part of how this site works.

        To post files here you use insert the <code> and </code> tags into your text. You just put the file between these tags where the 'and' is and then we can hit a button and download the lot.

        When munging a text file from one format to another the key thing is to know what it looks like now (the source) and what you want it to look like (the output). If you supply that along with the explanation already posted then it should only take a few minutes to complete the program, says he laughing in the face of the God of unforseen consequences!

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Converting GEDCOM files
by chipmunk (Parson) on Jul 09, 2001 at 19:36 UTC
    After examining the sample input and output files that Joes emailed to me, as well as some clarified rules, I think that I was actually quite close with my original script. Here's the final final (see below) version:
    #!/usr/local/bin/perl -w use strict; $/ = "\n0"; # read in one record at a time $^I = '.bak'; # modify input files in place, save originals with .bak my %convert = ( HEAL => 'Medical', HIST => 'Biography', EDUC => 'Educated', RESI => 'Resided', OCCU => 'Occupation', ); my $convert_re = join '|', keys %convert; $convert_re = qr/\b($convert_re)\b/; while (<>) { s/$convert_re/NOTE $convert{$1}:/g; if (/^.*SOUR/) { s/^1 NAME/1 TITL/m; s/^1 NAME/1 TEXT/m; s/^1 NAME/2 CONT/mg; } } continue { print; }
    I made two changes. The input record separator is now "\n0", because each section begins with a line starting with 0. (Although this will have each "\n0" read in as the end of the previous record, that doesn't make a difference for our purposes.) In the substitutions for NAME, I added a beginning of line anchor and removed the word 'data' (turns out that bit was just a placeholder for actual data). Also I fixed my typo of TITLE to be TITL (everything is four letters in this format).

    Update: Make that four changes. The change of NAME to TITL is only supposed to occur for sections that start 0 @Snnn@ SOUR, not for all sections. And I stupidly forgot the /m modifier on those regexes when I added the anchors. Thanks for the notice of the problems, tachyon! (Notwithstanding the suggestion that I'm reading in one line at a time; I'm actually reading in a block of lines ending with a newline followed by a zero, as I intended.)

      Hi chipmunk this does not work :o(

      Update

      Seems fixed now though. Really clever solution. I really like the "\n0" now that I understand it ;-) - a very neat way to chop the file into records, much nicer than the method I used.

      I have learned a very good use for $/ in a practical example - thanks.

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        Hi tachyon.
        1. In the clever little script you did for me, I find that I should have asked for the GEDCOM line 1 CHAR IBM DOS to be changed to 1 CHAR ASCII
        Can you please show me how to insert an amendment in your script?

        2. The standard for the GEDCOM names after each 0 @Inn@ INDI is in the next line:
        1 NAME First_name Middle_name/SURNAME/
        The first name syntax is as shown, with the first character in Upper case.
        The SURNAME between the forward slashes is in UPPERCASE, as shown.
        Unfortunately, in merging many GEDCOM files from various genealogists, we end up with sample names like:
        JOE/SLAVEN/
        Joe McDonald/Slaven/
        Joe McDONALD 'Macca'/SLAVEN/
        /SLAVEN/
        Joe
        Joe/SLAV.../ (this last one when the surname is illegible)
        Any chance of you please having a go at looking at each 1 NAME Firstname Middlename/SURNAME/ line and converting it to the standard syntax?

        3. In selecting the name of the output file, how do I go about naming it as the original input file name filename.ged, but with PAF added, as in filenamePAF.ged My best wishes
        Joe, Townsville, Australia

      Thanks to chipmonk and tachyon. You guys are great - you make this site an really outstanding and professional Monastery, My appreciation and kind regards to you both. I has been a pleasure seeing tow professionals in action. Joe Townsville, Australia
Re: Converting GEDCOM files
by tachyon (Chancellor) on Jul 09, 2001 at 21:01 UTC

    Here is the updated code based on the sample files sent which I have posted at the end for those who are interested. The key word "SOUR" actually appears in several contexts - the 1 NAME bit is not changed unless the line is /^0 @.+@ SOUR$/ which adds a little hook.

    The script has been updated to inlude the substitution of '1 CHAR IBM DOS' to '0 CHAR ASCII' and modify the 1 NAME fields into the standard format.

    Further updates have been added. The script will now handle hypenated last names and multiple last names correctly, it also handles the Mc/Mac problem reasonably well. The output file is written to 'filePAF.ged' where the imput file is 'file.ged'. Overwrite protection has been added with an overwrite prompt if the output file already exists. Files that do not end in 'ged' will be output to 'file.new' ie the '.new' suffix will be added.

    #!/usr/bin/perl -w use strict; my $file = ''; $file = shift @ARGV unless $file; unless ($file) { print "Please input a filename: "; chomp ($file = <>); } my %change = ( HEAL => 'NOTE Medical:', HIST => 'NOTE Biography:', EDUC => 'NOTE Educated:', RESI => 'NOTE Resided:', OCCU => 'NOTE Occupation:'); open (FILE, "<$file") or die "Unable to open $file: $!"; my @data = <FILE>; close FILE; my $change_this = join "|", keys %change; $change_this = qr/($change_this)/; my $found_sour = 0; my $last_line = ''; for (@data) { # make the IBM DOS to ASCII substitution s/1 CHAR IBM DOS/O CHAR ASCII/; s/$change_this/$change{$1}/g; $found_sour = 0 if m/^0/; $found_sour = 1 if m/^0.+SOUR\s*$/; if ($found_sour == 1) { $found_sour++ if s/1 NAME/1 TITL/; } elsif ($found_sour) { s/1 NAME/2 CONT/; } # fix the name problem if ($last_line =~ m/\bINDI\s*$/) { if (m/1\s*NAME/) { s|([\w.]+)|($1 eq "NAME")? $1 :ucfirst lc $1|eg; # the MacDonald fix, the \b (boundary) condition # and the quote check should limit appropriately # Names like Mace and Mack are treated as special # cases as this is the only way to deal with them s # substitute regex / # opening delimiter for regex (?<!["'`]) # look back for *not* a quote char \b # only match at the boundary of a word and + non word (Mc|Mac) # match Mc or Mac and capture in $1 ([\w-]+) # match the rest of the name in $2 / # finish the match part ot this regex # this is what we will use to substitute # we execute this code to get our result # to avoid special cases MacE and MacK # but still allow MacEvoy and MacKinley # we us a conditional ($1.$2 eq 'Mace' or $1.$2 eq 'Mack') ? # if this condition is true $1.$2 # we substitute what we just found back in + with no change : # else $1.ucfirst $2 # we substitute uppercased first letter of + second bit /egx; # the /e is for execute 2nd half # the /g is all cases # the /x allows all these comments! # here it is on one line, uncomment this and delete # the long one if you want # s/(?<!["'`])\b(Mc|Mac)([\w-]+)/($1.$2 eq 'Mace' or $1.$2 + eq 'Mack') ? $1.$2 : $1.ucfirst $2/egx; # uppercase surname if present, we look for pattern # /\w+/ To allow for commas, dots and spaces we # use a character class (the bit in the [ ] ) which # will match alphanumerics, spaces, commas, and dots # and hypens incases like SLEVIN-WINDSOR :-) s|(?<=/)([\w\s,.-]+)(?=/)|uc $1|eg; } } $last_line = $_; # remember the last line for INDI } # change the filename. if this substitution fails we warn that # the file name will be $file.new the /i makes it case insensitive # which is important for DOS based systems unless ($file =~ s/.ged$/PAF.ged/i) { warn "$file does not end in '.ged', saving as $file.new\n"; $file .= '.new'; } # let's add a test to avoid accidentaly overwriting files if (-e $file) { print "$file exists, overwrite (Y/N)? "; my $input = <>; unless ($input =~ m/y/i) { print "Aborting!\n"; exit; } } open (FILE, ">$file") or die "Unable to create $file: $!"; print FILE @data; close FILE; print "Data written to $file OK!";

    To see the commented code and the sample files