Multiple Line Data from text file to SQL

pie2re has asked for the wisdom of the Perl Monks concerning the following question:

Dears,

I'm currently trying to load some data coming from a text file into a database.

Here is a sample of data as it can be found in the file

27509.
The Sun's declination is:
A) the Sun's position relative to the plane of the
Equator. 
B) the distance between the Sun and the horizon.
C) the angular distance between the Sun and the celestial
North Pole.
D) the Sun's position relative to the ecliptic.
[download]

The problem is the following:

I would need to be able to handle the question id (the number on the first line), the question and each answer (a,b,c,d). As you can see in the sample above, the data is not very well formatted and sometimes the answer is spread on several lines. The file contains up to 2500 questions ans answers.

So, basically my question is:

How could I parse the file to isolate every question with its answers even if one answer is spread on several lines?

Thanks a million for your help.

Comment on Multiple Line Data from text file to SQL Download Code

Replies are listed 'Best First'.
Re: Multiple choice from text file to SQL by kennethk (Abbot) on Mar 03, 2009 at 15:32 UTC
My reading of your sample data would suggest you have the format: `(Unique integer). (Question)"/n" A)(Answer A)"/n" B)(Answer B)"/n" C)(Answer C)"/n" D)(Answer D)"/n"` [download] A similar question was asked yesterday. Likely the easiest solution is to assume a line that starts with a letter and a close parenthesis is the start of the next answer. You might find regular expressions useful to write that test. I suspect there will be more of a challenge differentiating the end of D) from the id number of the next question. If you haven't thought about it yet, a hash of hashes is probably the best solution for storing your read data. Good luck, and please post code if you run into development headaches.	[reply] [d/l]
Re^2: Multiple choice from text file to SQL by pie2re (Initiate) on Mar 03, 2009 at 20:43 UTC
Hi, Thanks for your advice and sorry for posting something already asked. The post you referenced in your reply was very helpful. But some darkness remains in my DBA's brain. Here is the code I've done: `#!/usr/bin/perl $file = "data.txt"; open (INPUT, "< $file"); undef $/; my $string = <INPUT>; $string =~ s/\n//g; while ($string =~ /A\)(.*?)B\)/g) { print "<a>".$1."</a>\n"; } close (INPUT);` [download] This prints out all the occurrences of the A)... as expected Thanks again for your precious knowledge. P.	[reply] [d/l]
Re^3: Multiple choice from text file to SQL by kennethk (Abbot) on Mar 03, 2009 at 21:24 UTC
Why limit your code to one match at a time? As long as you are capturing things, you can use 1 regex to extract all the details from a question at once. If your questions are separated by a blank line (i.e. end marked by "\n\n"), you could try something like: `#!/usr/bin/perl use strict; use warnings; my $file = "data.txt"; open (INPUT, "< $file"); undef $/; my $string = <INPUT>; while ($string =~ /(\d+)\.\n(.+?)A\)(.+?)B\)(.+?)C\)(.+?)D\)(.+?)(\n\n +\|$)/sg) { my @captures = ($1, $2, $3, $4, $5, $6); foreach (@captures) { s/\n//g; } print "<QN>$captures[0]</QN>\n"; print "<QQ>$captures[1]</QQ>\n"; print "<Aa>$captures[2]</Aa>\n"; print "<Ab>$captures[3]</Ab>\n"; print "<Ac>$captures[4]</Ac>\n"; print "<Ad>$captures[5]</Ad>\n"; } close (INPUT);` [download] Some side notes: you should `use strict; use warnings` in your scripts since they'll help prevent typos from causing you incredible misery. You should use 3-argument open in place of the 2-argument form you used - 3-argument has some important security implications.	[reply] [d/l] [select]
Re: Multiple Line Data from text file to SQL by clueless newbie (Curate) on Mar 03, 2009 at 21:08 UTC
# Slurp the file my $Text=do { local $/; <DATA> }; # Assuming that a question begins with a number followed by a period: # Then a question is a question number followed by lines which are not + question numbers. #while ($Text =~ m{(\d+\.\n([^\n\r\f][a-zA-Z]+[^\n\r\f]\n)+)}s) { while ($Text =~ m{((\d+)\.\n(([^\n\r\f][a-zA-Z]+[^\n\r\f]\n)+)(A\)([ +^\n\r\f][a-zA-Z]+[^\n\r\f]\n)+?)(B\)([^\n\r\f][a-zA-Z]+[^\n\r\f]\ +n)+?)(C\)([^\n\r\f][a-zA-Z]+[^\n\r\f]\n)+?)(D\)([^\n\r\f][a-zA-Z]+ +[^\n\r\f]\n)+))}gs) { ### question number :$2 ### question :$3 ### A) :$5 ### B) :$7 ### C) :$9 ### D) :$11 #using g so #$Text=$'; }; [download] works but assumes that a question number of the form \d+\. begins each question, the answers must be A, B, C and D and of the form x) ..., and none of the answers are split so that the trailing portion resembles a question number.	[reply] [d/l]