Digital Document Capture

eric256 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Digital Document Capture by sgifford (Prior) on Aug 13, 2003 at 16:25 UTC
It would probably be easier to encode the information in some way other than a bar code, such as a comment near the top or bottom of the PDF file. Bar codes are a useful way to get digital information out of a physical document, but they're generally not the best way to transmit digital information between two computers.	[reply]
Re: Re: Digital Document Capture by eric256 (Parson) on Aug 13, 2003 at 17:00 UTC
The PDF is created automaticaly by the scanner. I never have a chance to encode it with anything. :-( The bar code would be printed as a seperate process. I was also thinking of just haveing one barcode per location. Then there would be just a standard cover sheet they could use to send the documents. Update: I can get the document scanned as tiff files, one page per file. I don't know if maybe tiff is easier to scan. ___________ Eric Hodges	[reply]
Re: Re: Re: Digital Document Capture by dtr (Scribe) on Aug 13, 2003 at 20:21 UTC
The first thing I would ask myself if I was in your shoes would be why I was introducing an extra PC, some code, and an extra step for the sorting room people, when they system they have at the moment probably works OK as it is..... Assuming that you have a good answer to this, then you may be interested to know that PDF is derived from PostScript. Postscript is a fully fledged programming language - which can be written in text format. It also supports comments :) To see what I mean, paste the following output into a file called "test.ps", and open it in your favourite PostScript viewer. %!PS-Adobe-3.0 EPSF-1.2 %%Title: (test1.ps) %%LanguageLevel: 1 %%Creator: DTR %%CreationDate: Sun Dec 8 18:25:51 2002 %%For: Console %%DocumentMedia: A4 595.27559 841.88976 0 ( ) ( ) %%Orientation: Portrait %%Pages: 1 %%BoundingBox: 0 0 595 841 %%EndComments %%BeginProlog %%BeginResource: PostScript::Simple /u {} def /STARTDIFFENC { mark } bind def /ENDDIFFENC { % /NewEnc BaseEnc STARTDIFFENC number or glyphname ... ENDDIFFENC - counttomark 2 add -1 roll 256 array copy /TempEncode exch def % pointer for sequential encodings /EncodePointer 0 def { % Get the bottom object counttomark -1 roll % Is it a mark? dup type dup /marktype eq { % End of encoding pop pop exit } { /nametype eq { % Insert the name at EncodePointer % and increment the pointer. TempEncode EncodePointer 3 -1 roll put /EncodePointer EncodePointer 1 add def } { % Set the EncodePointer to the number /EncodePointer exch def } ifelse } ifelse } loop TempEncode def } bind def % Define ISO Latin1 encoding if it doesnt exist /ISOLatin1Encoding where { % (ISOLatin1 exists!) = pop } { (ISOLatin1 does not exist, creating...) = /ISOLatin1Encoding StandardEncoding STARTDIFFENC 144 /dotlessi /grave /acute /circumflex /tilde /macron /breve /dotaccent /dieresis /.notdef /ring /cedilla /.notdef /hungarumlaut /ogonek /caron /space /exclamdown /cent /sterling /currency /yen /brokenbar /section /dieresis /copyright /ordfeminine /guillemotleft /logicalnot /hyphen /registered /macron /degree /plusminus /twosuperior /threesuperior /acute /mu /paragraph /periodcentered /cedilla /onesuperior /ordmasculine /guillemotright /onequarter /onehalf /threequarters /questiondown /Agrave /Aacute /Acircumflex /Atilde /Adieresis /Aring /AE /Ccedilla /Egrave /Eacute /Ecircumflex /Edieresis /Igrave /Iacute /Icircumflex /Idieresis /Eth /Ntilde /Ograve /Oacute /Ocircumflex /Otilde /Odieresis /multiply /Oslash /Ugrave /Uacute /Ucircumflex /Udieresis /Yacute /Thorn /germandbls /agrave /aacute /acircumflex /atilde /adieresis /aring /ae /ccedilla /egrave /eacute /ecircumflex /edieresis /igrave /iacute /icircumflex /idieresis /eth /ntilde /ograve /oacute /ocircumflex /otilde /odieresis /divide /oslash /ugrave /uacute /ucircumflex /udieresis /yacute /thorn /ydieresis ENDDIFFENC } ifelse % Name: Re-encode Font % Description: Creates a new font using the named encoding. /REENCODEFONT { % /Newfont NewEncoding /Oldfont findfont dup length 4 add dict begin { % forall 1 index /FID ne 2 index /UniqueID ne and 2 index /XUID ne and { def } { pop pop } ifelse } forall /Encoding exch def % defs for DPS /BitmapWidths false def /ExactSize 0 def /InBetweenSize 0 def /TransformedChar 0 def currentdict end definefont pop } bind def % Reencode the std fonts: /Courier-iso ISOLatin1Encoding /Courier REENCODEFONT /Courier-Bold-iso ISOLatin1Encoding /Courier-Bold REENCODEFONT /Courier-BoldOblique-iso ISOLatin1Encoding /Courier-BoldOblique REENCO +DEFONT /Courier-Oblique-iso ISOLatin1Encoding /Courier-Oblique REENCODEFONT /Helvetica-iso ISOLatin1Encoding /Helvetica REENCODEFONT /Helvetica-Bold-iso ISOLatin1Encoding /Helvetica-Bold REENCODEFONT /Helvetica-BoldOblique-iso ISOLatin1Encoding /Helvetica-BoldOblique RE +ENCODEFONT /Helvetica-Oblique-iso ISOLatin1Encoding /Helvetica-Oblique REENCODEFO +NT /Times-Roman-iso ISOLatin1Encoding /Times-Roman REENCODEFONT /Times-Bold-iso ISOLatin1Encoding /Times-Bold REENCODEFONT /Times-BoldItalic-iso ISOLatin1Encoding /Times-BoldItalic REENCODEFONT /Times-Italic-iso ISOLatin1Encoding /Times-Italic REENCODEFONT /Symbol-iso ISOLatin1Encoding /Symbol REENCODEFONT /box { newpath 3 copy pop exch 4 copy pop pop 8 copy pop pop pop pop exch pop exch 3 copy pop pop exch moveto lineto lineto lineto pop pop pop pop closepath } bind def /circle {newpath 0 360 arc closepath} bind def %%EndResource %%EndProlog % TRY CHANGING SOME OF THESE VALUES TO GET A FEEL % FOR WHAT HAPPENS 0.6 0 0 setrgbcolor %red /Arial findfont 20 scalefont setfont newpath 1 u 450 u moveto (Hello world!) show stroke newpath 1 u 200 u moveto (This \(stuff\) was generated entirely from Perl) show stroke 0 0.8 0 setrgbcolor 50 u 50 u 150 u 150 u box stroke 1 1 0.2 setrgbcolor 100 u 100 u 50 u circle fill 0 0 0.8 setrgbcolor newpath 100 u 100 u moveto (This is a test) dup stringwidth pop 2 div neg 0 rmoveto show % A simple arc newpath 300 300 50 0 90 arc closepath stroke % A arc between 2 lines % giving the appearance of a rounded corner newpath 400 400 moveto 400 410 lineto 400 420 410 420 10 arct 420 420 lineto stroke % An example of a box in a dropout colour newpath 500 500 moveto 0 20 rlineto 20 0 rlineto 0 -20 rlineto closepath 1 1 1 setrgbcolor fill /Arial findfont 20 scalefont setfont 510 500 moveto 0.6 0.6 0.4 setrgbcolor 0 0.8 0 setrgbcolor 0 u 180 u 20 u 160 u box stroke 22 u 180 u 42 u 160 u box stroke 44 u 180 u 64 u 160 u box stroke /Arial findfont 18 scalefont setfont 0 0 0.8 setrgbcolor newpath 10 u 163 u moveto (B) dup stringwidth pop 2 div neg 0 rmoveto show newpath 32 u 163 u moveto (O) dup stringwidth pop 2 div neg 0 rmoveto show newpath 54 u 163 u moveto (X) dup stringwidth pop 2 div neg 0 rmoveto show /Arial findfont 12 scalefont setfont 0.8 0 0 setrgbcolor newpath 10 u 350 u moveto (My address is:) show stroke newpath 10 u 330 u moveto (HELLO) show stroke %%EOF [download] NOTE - where some of the "REENCODEFONT" words have been wrapped above with a "+" sign, you need to remove the "+" sign and put them onto one line for this to work Credit is due to the PostScript::Simple module on CPAN for getting me started with this. Also - disclaimer - I know just enough PostScript to draw the circles, boxes, and text I drew on that page - no more. Anyway, that should be enough to get you started :). You should be able to insert a few harmless comments of your own at the top of a PostScript file (you may also be able to do it with PDF), and use these to keep track of where the document should go.	[reply] [d/l]
Re: Digital Document Capture by benn (Vicar) on Aug 13, 2003 at 17:50 UTC
A search for 'Barcode' on CPAN reveals plenty of modules for creating barcodes, but virtually nothing for reading them in again...it presumably is possible, but a lot would depend upon how your scanner creates them - either as a series of vectors or as a bitmap, I would presume. If the latter, you might be able to read in the image with GD and do some pixel-munching, but it sounds like a task-and-a-half. I don't know anything about your ScanToEmail machine, but would there be any way of spitting out the actual barcode number as part of the scanning process? Say, put the number into the Subject header or something? Cheers, Ben.	[reply]
Re: Digital Document Capture by waswas-fng (Curate) on Aug 13, 2003 at 18:30 UTC
Welp no need for barcodes, use one of the pdf modules to create your cover page with routing info included and insert it as the front page. on the other side use a pdf module to get that information out. Bar codes are great as physical indexers because most of the time the barcode scanner reads a short index numnber or secuence that allows you to pull up related electronic data. They are the wrong tool for a purely digital transaction. -Waswas	[reply]
Re: Digital Document Capture by derby (Abbot) on Aug 13, 2003 at 20:01 UTC
I'm confused (nothing new). It appears there is no need for barcoding at all. Instead of a "special email address", you could have "special email addresses" - one for each office. Then you would "scantoemail" distinct packages so that each individual email received is then also a distinct deliverable. A generic script (procmail filter) would take the incoming email, file it away in the filesystem, based on recipient (toplevel directory) and some unique number (or datetime as a subdir - with the actual contents of the directory in the subdir), update the db with the actual filesystem and other meta-data (maybe the scantoemail'ers can put a subject line on the email). Seems pretty basic and rote to me (but boy I'd hate to be the operator). -derby	[reply]
Re: Digital Document Capture by eric256 (Parson) on Aug 13, 2003 at 21:14 UTC
Thanks for all the replies but i appear to have confused you all somewhere. The documents all start in paper form. Not digital. The cover sheet would also need to be in paper form not digital so that they could join it with which ever set of documents they want to transfer. So I start with all paper, and scan it to email (using an HP 4100mfp). There is no computer attatched to the MFP to attatch any form of digital document. I've been searching and have found some Barcode recognition software that i may try, since barcodes still seem to be the only safe way to put the info on need on a peice of paper and have it read by the scanner. Currently they do send each batch to its own email address, but that has its own limitations and downfalls. Thanks again! ___________ Eric Hodges	[reply]
Re: Re: Digital Document Capture by Zero_Flop (Pilgrim) on Aug 14, 2003 at 06:17 UTC
To not sure if this would work for you, and I may be way off base but how about this. Your current system is, sort->scan->send. And you are trying t simplify the send step. This appears to be the simplest step. Now if if you really want to get crazy how about this. scan->PDF->Bayesian filters->send Here as the paper docs arrive they are immediately scanned as PDF. Then using one of the PDF modules you pull the text out and send it through a series of Bayesian filters to determine where they should go. The PDF file is then attached into an email or loaded into a db based on the characteristics in the file. (The text version is only used for the Bayesian filter, and is discarded as soon as you deal with the PDF.) This reduces the workload of the sorters to just scanning the files.	[reply]
Re: Re: Re: Digital Document Capture by eric256 (Parson) on Aug 14, 2003 at 15:04 UTC
Would be great. Unfortunatly the sort part of the process includes more than just sorting. Often there are human actions (phone calls, scheduling, etc) that need to take place, and most of the documents are hardly legible (hand writting, doodled on, faxed many times, etc) to humans, let alone computers :-( ___________ Eric Hodges	[reply]