Re: Converting GEDCOM files
by chipmunk (Parson) on Jul 07, 2001 at 18:23 UTC
|
Perl is very well-suited to doing text manipulations. A script to do this conversion might look something like this:
#!/usr/local/bin/perl -w
use strict;
$/ = 'SOUR'; # read in one record at a time
$^I = '.bak'; # modify input files in place, save originals with .bak
my %convert = (
HEAL => 'Medical',
HIST => 'Biography',
EDUC => 'Educated',
RESI => 'Resided',
OCCU => 'Occupation',
);
my $convert_re = join '|', keys %convert;
$convert_re = qr/\b($convert_re)\b/;
while (<>) {
s/$convert_re/NOTE $convert{$1}:/g;
s/1 NAME data/1 TITLE/;
s/1 NAME data/1 TEXT/;
s/1 NAME data/2 CONT/g;
} continue {
print;
}
I expect I haven't gotten that quite right. (Does '1 NAME data' appear in the file literally, or is that what certain strings in the file look like?) It would be very helpful to see a snippet of the input file. :) Anyway, that should get you started. | [reply] [d/l] |
|
|
Thanks Chipmonk -your offer is very much appreciated. I am brand new to this Monastery, so
how do I contact you to send a sample .ged file, and one showing the expected outcome?
Joe
Email: slaven01@ozemail.com.au
| [reply] |
There's already a GEDCOM module on CPAN
by petdance (Parson) on Jul 07, 2001 at 20:01 UTC
|
There's already a GEDCOM.pm module on CPAN.
It works pretty well, although it's not as robust as one might like. I'm considering redoing it w/Parse::RecDescent in my Copious Free Time (tm).
xoxo,
Andy
--
<megaphone>
Throw down the gun and tiara and come out of the float!
</megaphone>
| [reply] |
|
|
Thanks Andy. As a raw beginner (initiate) to the monastery, what is CPAN, and how do I find the module you refer to?
My thanks for responding,
Joe.
| [reply] |
|
|
CPAN is the Comprehensive Perl Archive Network. It contains a vast array of modules.
A module is a library of code that does some related task like parse CGI data for web programs. I had
a quick look at Gedcom.pm and while it could help do what you want it was not designed for the task
at hand. It was more designed to manipulate Gedcom files in a way that would let you write your own programs.
For a guide to all this stuff see this tutorial New Monks. It was written to help answer all the sort
of questions you probably have.
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] |
Re: Converting GEDCOM files
by lshatzer (Friar) on Jul 07, 2001 at 17:58 UTC
|
Just doing a quick search, I found a few things.
- Reunion appears to be software for the mac, so this might cause some of the problems when importing into a PC (into PAF).
- Check out this link, it is on ReunionTalk's message board, about a similar problem.
- On another message board, I found this which also might be of some help.
- Doing some searches on google for GEDCOM and REUNION with some other keywords such as conversion, etc. might be of some help.
| [reply] |
|
|
Thanks for your research. Yuo were my first ever response on this PM site It opened up a few new areas for me.
Reunion was originally develeoped for both Mac and for Windows, but a large software corporation bought out the Windows system, and re-released it under Generations.
My kind regards,
Joe, townsville
| [reply] |
Re: Converting GEDCOM files
by tachyon (Chancellor) on Jul 07, 2001 at 19:39 UTC
|
I can't quite get the jist of what you want with the SOUR part. As you are a fellow Aussie I will post a solution for you if you post: A sample of the SOURCE file with a corresponding example of how the OUTPUT should look. Add any notes before, after but not in the SOURCE and OUTPUT.
Here is some code to do the first bit you want. You will need perl to use this. For details on getting perl see New Monks part 7. Do you want the latest version of Perl?
#!/usr/bin/perl -w
use strict;
my @data;
# define a hash of changes to be made
my %change = ( HEAL => 'NOTE Medical:',
HIST => 'NOTE Biography:',
EDUC => 'NOTE Educated:',
RESI => 'NOTE Resided:',
OCCU => 'NOTE Occupation:',
);
# accept input via three mechanisms
# if no file specified here:
my $file = '';
# then look for a command line argument here:
$file = shift @ARGV unless $file;
# if we still have no file to process
# enter interactive mode here:
unless ($file) {
print "Please enter a file to process: ";
chomp($file = <>);
}
# get the data
open (FILE, "<$file") or die "Unable to open $file: $!";
@data = <FILE>;
close FILE;
# make the changes
for (@data) {
for my $key (keys %change) {
s/$key/$change{$key}/g;
}
}
# print the changes to a new file with the same name
# as the original plus a .new suffix.
open (FILE, ">$file.new") or die "Unable to write $file.new: $!";
print FILE @data;
close FILE;
print "Munged $file no worries!\n";
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] |
|
|
Thanks Tachyon -your offer is very much appreciated.
I am brand new to this Monastery, so how do I contact you to send a sample .ged file?
PS liked your hang glider - flying for 20 years? and still around to tell the tale??
| [reply] |
|
|
Yes still here. </BEGIN_RANT> Stats are that you are about twice a likely to kill yourself in a hang glider as in a car.
However it is about 5 times safer than riding a motor bike. Twice as safe as horse riding..... Life is a terminal disease you know </END_RANT>
Anyway I have added an email link to my home node at the bottom so if you don't
want to post details here (which would be considered usual) then send it to me there. It is generally considered proper netiquette
to continue this discussion in this forum rather than in private but as everyone knows we Aussies have no manners :-) Some of the
advantages of doing things in a open forum include allowing others to learn from the discussion, getting second, third,.... opinions on
the quality of the code and advice presented. Allowing Monks to show off their skills to their peers is also a big part of how this site works.
To post files here you use insert the <code> and </code> tags into your text. You just put the file
between these tags where the 'and' is and then we can hit a button and download the lot.
When munging a text file from one format to another the key thing is to know what it
looks like now (the source) and what you want it to look like (the output). If you supply that along with the explanation
already posted then it should only take a few minutes to complete the program, says he laughing in the face of the God of
unforseen consequences!
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] |
|
|
|
|
Re: Converting GEDCOM files
by chipmunk (Parson) on Jul 09, 2001 at 19:36 UTC
|
After examining the sample input and output files that Joes emailed to me, as well as some clarified rules, I think that I was actually quite close with my original script. Here's the final final (see below) version:
#!/usr/local/bin/perl -w
use strict;
$/ = "\n0"; # read in one record at a time
$^I = '.bak'; # modify input files in place, save originals with .bak
my %convert = (
HEAL => 'Medical',
HIST => 'Biography',
EDUC => 'Educated',
RESI => 'Resided',
OCCU => 'Occupation',
);
my $convert_re = join '|', keys %convert;
$convert_re = qr/\b($convert_re)\b/;
while (<>) {
s/$convert_re/NOTE $convert{$1}:/g;
if (/^.*SOUR/) {
s/^1 NAME/1 TITL/m;
s/^1 NAME/1 TEXT/m;
s/^1 NAME/2 CONT/mg;
}
} continue {
print;
}
I made two changes. The input record separator is now "\n0", because each section begins with a line starting with 0. (Although this will have each "\n0" read in as the end of the previous record, that doesn't make a difference for our purposes.) In the substitutions for NAME, I added a beginning of line anchor and removed the word 'data' (turns out that bit was just a placeholder for actual data). Also I fixed my typo of TITLE to be TITL (everything is four letters in this format).
Update: Make that four changes. The change of NAME to TITL is only supposed to occur for sections that start 0 @Snnn@ SOUR, not for all sections. And I stupidly forgot the /m modifier on those regexes when I added the anchors. Thanks for the notice of the problems, tachyon! (Notwithstanding the suggestion that I'm reading in one line at a time; I'm actually reading in a block of lines ending with a newline followed by a zero, as I intended.) | [reply] [d/l] |
|
|
Hi chipmunk this does not work :o(
Update
Seems fixed now though. Really clever solution. I really like the "\n0" now that I understand it ;-) - a very neat way to chop the file into records, much nicer than the method I used.
I have learned a very good use for $/ in a practical example - thanks.
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] |
|
|
Hi tachyon. 1. In the clever little script you did for me, I find that I should have asked for the GEDCOM line 1 CHAR IBM DOS to be changed to 1 CHAR ASCII
Can you please show me how to insert an amendment in your script?
2. The standard for the GEDCOM names after each 0 @Inn@ INDI is in the next line: 1 NAME First_name Middle_name/SURNAME/ The first name syntax is as shown, with the first character in Upper case. The SURNAME between the forward slashes is in UPPERCASE, as shown.
Unfortunately, in merging many GEDCOM files from various genealogists, we end up with sample names like:
JOE/SLAVEN/
Joe McDonald/Slaven/
Joe McDONALD 'Macca'/SLAVEN/
/SLAVEN/
Joe
Joe/SLAV.../ (this last one when the surname is illegible)
Any chance of you please having a go at looking at each 1 NAME Firstname Middlename/SURNAME/ line and converting it to the standard syntax?
3. In selecting the name of the output file, how do I go about naming it as the original input file name filename.ged, but with PAF added, as in filenamePAF.ged
My best wishes
Joe, Townsville, Australia
| [reply] |
|
|
|
|
Thanks to chipmonk and tachyon. You guys are great - you make this site an really outstanding and professional Monastery,
My appreciation and kind regards to you both. I has been a pleasure seeing tow professionals in action.
Joe
Townsville, Australia
| [reply] |
Re: Converting GEDCOM files
by tachyon (Chancellor) on Jul 09, 2001 at 21:01 UTC
|
Here is the updated code based on the sample files sent which I have posted at the end for those who are interested. The key word "SOUR" actually appears in several contexts - the 1 NAME bit is not changed unless the line is /^0 @.+@ SOUR$/ which adds a little hook.
The script has been updated to inlude the substitution of '1 CHAR IBM DOS' to '0 CHAR ASCII' and modify the 1 NAME fields into the standard format.
Further updates have been added. The script will now handle hypenated last names and multiple last names correctly, it also handles the Mc/Mac problem reasonably well. The output file is written to 'filePAF.ged' where the imput file is 'file.ged'. Overwrite protection has been added with an overwrite prompt if the output file already exists. Files that do not end in 'ged' will be output to 'file.new' ie the '.new' suffix will be added.
#!/usr/bin/perl -w
use strict;
my $file = '';
$file = shift @ARGV unless $file;
unless ($file) {
print "Please input a filename: ";
chomp ($file = <>);
}
my %change = (
HEAL => 'NOTE Medical:',
HIST => 'NOTE Biography:',
EDUC => 'NOTE Educated:',
RESI => 'NOTE Resided:',
OCCU => 'NOTE Occupation:');
open (FILE, "<$file") or die "Unable to open $file: $!";
my @data = <FILE>;
close FILE;
my $change_this = join "|", keys %change;
$change_this = qr/($change_this)/;
my $found_sour = 0;
my $last_line = '';
for (@data) {
# make the IBM DOS to ASCII substitution
s/1 CHAR IBM DOS/O CHAR ASCII/;
s/$change_this/$change{$1}/g;
$found_sour = 0 if m/^0/;
$found_sour = 1 if m/^0.+SOUR\s*$/;
if ($found_sour == 1) {
$found_sour++ if s/1 NAME/1 TITL/;
}
elsif ($found_sour) {
s/1 NAME/2 CONT/;
}
# fix the name problem
if ($last_line =~ m/\bINDI\s*$/) {
if (m/1\s*NAME/) {
s|([\w.]+)|($1 eq "NAME")? $1 :ucfirst lc $1|eg;
# the MacDonald fix, the \b (boundary) condition
# and the quote check should limit appropriately
# Names like Mace and Mack are treated as special
# cases as this is the only way to deal with them
s # substitute regex
/ # opening delimiter for regex
(?<!["'`]) # look back for *not* a quote char
\b # only match at the boundary of a word and
+ non word
(Mc|Mac) # match Mc or Mac and capture in $1
([\w-]+) # match the rest of the name in $2
/ # finish the match part ot this regex
# this is what we will use to substitute
# we execute this code to get our result
# to avoid special cases MacE and MacK
# but still allow MacEvoy and MacKinley
# we us a conditional
($1.$2 eq 'Mace' or $1.$2 eq 'Mack')
? # if this condition is true
$1.$2 # we substitute what we just found back in
+ with no change
: # else
$1.ucfirst $2 # we substitute uppercased first letter of
+ second bit
/egx; # the /e is for execute 2nd half
# the /g is all cases
# the /x allows all these comments!
# here it is on one line, uncomment this and delete
# the long one if you want
# s/(?<!["'`])\b(Mc|Mac)([\w-]+)/($1.$2 eq 'Mace' or $1.$2
+ eq 'Mack') ? $1.$2 : $1.ucfirst $2/egx;
# uppercase surname if present, we look for pattern
# /\w+/ To allow for commas, dots and spaces we
# use a character class (the bit in the [ ] ) which
# will match alphanumerics, spaces, commas, and dots
# and hypens incases like SLEVIN-WINDSOR :-)
s|(?<=/)([\w\s,.-]+)(?=/)|uc $1|eg;
}
}
$last_line = $_; # remember the last line for INDI
}
# change the filename. if this substitution fails we warn that
# the file name will be $file.new the /i makes it case insensitive
# which is important for DOS based systems
unless ($file =~ s/.ged$/PAF.ged/i) {
warn "$file does not end in '.ged', saving as $file.new\n";
$file .= '.new';
}
# let's add a test to avoid accidentaly overwriting files
if (-e $file) {
print "$file exists, overwrite (Y/N)? ";
my $input = <>;
unless ($input =~ m/y/i) {
print "Aborting!\n";
exit;
}
}
open (FILE, ">$file") or die "Unable to create $file: $!";
print FILE @data;
close FILE;
print "Data written to $file OK!";
To see the commented code and the sample files
| [reply] [d/l] [select] |