http://qs1969.pair.com?node_id=11118593


in reply to Re^5: Calculated position incorrect when using regex in text file that also contains binary info
in thread Calculated position incorrect when using regex in text file that also contains binary info

Hi haukex,

I have to make a really big confession... After all, we're in a monastery. With monks. So it should be possible, allowed and doable, right? ;-)

I made a horrible mistake by checking the size of the calculated offset in my text editor (NPP) instead of comparing it with the XREF table which is part of the PDF itself. If I do this, then all is working perfect. I'm a bit of a dumb now...

Since I'm so pissed off (of myself, I mean) I stubbornly refused to give up and moved forward with making my SSCCE (as promised), even if it's only to "prove" myself I'm able to do this... :-)

And here's the result (in case it might be useful for someone else in the future that doesn't make such a silly mistake as I did...)

I've tried to assemble an example that should (hopefully) be SSCCE-compliant. Fingers crossed I got it right this time.

First, the original, uncompressed PDF file. It's a simple and very small PDF document called example_uncompressed.pdf. It's been made with LibreOffice and saved as PDF. It contains only one line of text: "A small PDF.".

Here goes the content of the uncompressed file. I've put it between the "readmore" and "code" tags as advised by you and others, I do hope it works (running the risk of being keelhauled if it isn't...):

%PDF-1.5 % 1 0 obj << /Group << /CS /DeviceRGB /I true /S /Transparency >> /Parent 2 0 R /MediaBox [0 0 595.303937007874 841.889763779528] /Resources 3 0 R /pdftk_PageNum 1 /Contents 4 0 R /Type /Page >> endobj 4 0 obj << /Length 150 >> stream 0.1 w q 0 0.028 595.275 841.861 re W* n q 0 0 0 rg BT 42.6 788.189 Td /F1 12 Tf[<01>55<0203>-2<04>2<05>1<06>-5<06>2<0207>-1< +08>-2<09>81<0A>]TJ ET Q Q endstream endobj 2 0 obj << /MediaBox [0 0 595 841] /Resources 3 0 R /Kids [1 0 R] /Count 1 /Type /Pages >> endobj 5 0 obj << /Length 9744 /Length1 9744 >> stream true  @cmap"  cvt =O;  fpgm\  glyfA h head=/ + ( 6hheaQy ` $hmtx<*  ,locatr  maxp*@  name ?H  vpostd $ "` prepLb " +     + + +   H=  p=  +  L +  ] F P i u + i P Z Z P ` + P P m { 1 o  + 1 M  f f  ›‹ J f + / ^ t F F < } S h v = + } J A l  T / Hj g a A + U )% 4 2$ U 4 Kan_= m {dN @G[ZYXUTSRQPONMLKJIHGFEDCBA@?>=<;:9876510/.-,('&%$#" +!  , `E% Fa#E#aH-, EhD-,E#F` a F`&#HH-,E#F#a ` & +a a&#HH-,E#F`@a f`&#HH-,E#F#a@` &a@a&#HH-, < <-, E# D +# ZQX# D#Y QX# MD#Y &QX# D#Y!!-, EhD ` EFvhE`D-, C#Ce -, C#C -, (#p(>(#p(E:  -, E%EadPQXED!!Y-,I#D-, E C`D-,CCe -, i@a ‹ , b`+ d#da\XaY-,E+)#D)z-,Ee,#DE+#D-,KR +XED!!Y-,KQXED!!Y-,%# `#-,%# a#-,% -, +CRX!!!!!F#F`F# F`ab# # pE` PXa‹FY`h:Y-, E +%FRKQ[X%F ha%%?#!8!Y-, E%FPX%F ha%%?#!8!Y-, C +C -,!! d#d‹@ b-,!QX d#d‹ b @/+Y`-,!QX d#d‹Ub /+Y` +-, d#d‹@ b`#!-,KSX%Id#Ei@‹ab aj#D#!# 9/Y-,KSX %Id +i &%Id#ab aj#D&#D#D& 9# 9//Y-,E#E`#E +`#E`#vhb -,H+-, E TX@D E@aD!!Y-,E0/E#Ea``iD-,KQX/#p#B! +!Y-,KQX %EiSXD!!Y!!Y-,EC `c`iD-,/ED-,E# E`D-,E#E`D-,K#QX +34 3 4 YDD-,CX&EXdf`d `f X!@YaY#XeY)#D#)!!!!!Y +-,CTXKS#KQZX8!!Y!!!!Y-,CX%Ed `f X!@Ya#XeY)#D%% X +Y%% F%#B<%%%% F%`#B< X Y%%)) EeD% +%)%% XY%%CH%%%%`CH!Y!!!!!!!-,% F%#B +%%EH!!!!-,% %%CH!!!-,E# E P X#e#Y#h @PX!@Y#XeY`D-, +KS#KQZX E`D!!Y-,KTX E`D!!Y-,KS#KQZX8!!Y-, !KTX8!!Y-,CTXF+! +!!!Y-,CTXG+!!!Y-,CTXH+!!!!Y-,CTXI+!!!Y-, #KSKQZX#8!!Y +-, %I SX @8!Y-,F#F`#Fa#  Fab@@pE`h:-, #Id#SX<!Y-,KR +X}zY-, KKTB-, B#Q@SZX TXC`BY$QX @TXC` +B$TX C`B KKRXC`BY@ TXC`BY@ c TXC`BY@  + c TXC`BY&QX@  c TX@C`BY@  c TXC`BYYYYYY +CTX@ @@ @  CTX@   CRX@ @@ @Y@ U@  c UZX  YYYBBBBB-,Eh#KQX# E d@PX|Yh`YD-, %%#> #> #eB #B#? #? #eB#B-,CPCT[X!# Y-,Y+-, +-   H  _@   `Y _Y tdTD4$ +tdTD4$9tTD4$ +pP@0 pPO]]]]]] +]]qqqqqqqqqq_qqrrrrrrrrrrrrrr^]]]]]]]]]]]]]]]]qqqqqqqqqqqqqqqq ?3+ 3 +33?339/3+ .3939939910%!573!57!!Gɾ۪ɴ +‹55555}hu  T ( Y@4!  )* !PY PY*p*`*P* **_*O*]]]qqqqq ?3+ ?3+ 99993 +33310#"&'533254/.54632#'&#"ӱF0-1Kx™Ye\2g›/*5r +QUMNZ?#Dz4!DcF|m/PD9N2.CV  +  1@ ! ((-- +! 321.PY1(! -+-PY+ RY ' %RYv3V363$33 +3333333b3P3D303$333h3333t3@343$3 +3333333d3P3D343$333333333k3;3 3 +338333`3T3@343333333t@+3T3@343333 +333p3P333^]]]]]]_]]qqqqqqqqqqqrrrrrrrr^]]]]]]]]]]]qqqqqq +qqqqqqqrrrrrrrrr^]]]]]]_]]]]]]]qqqqqq ?+ 33?+ 33?33+ 33333?+ +93333310>32>32!574#"!574&#"!57'5!FK +@EuMDyUEE?B‹UUXVww`+:49+B--X 6A--XSY-- -  Hq  % m@?%% '&%" "QYPY   PY PY '_'@'']qrr ?+ ?+ ?39/_^]+ 9+ 39333310 +2!'#"4>?54&#"#563267њurIGJdS"8 _Dc2 +~-^r^{Aa\/u#^n  )  @p PY PY      t d $   9   p `  +P @    P      p ` P @ ]]]]]]]qqqqqqrrrr_rrr^]] +]]]]]]]]] ?+ 3?+910%!57'5!oFF---  ; !=   @ +   `Y `Y_Y  _Y o/oO?/o_8/ +?^]]]]]]qqqqrr^]]]]]]qqqqqqqrrrrr ?+ 3?++ 9 +/+99333104&+326!57'5! #ZbhN˟ +B555u  ;u=  L@/  _Y`Y_Y`Y ?@ p]]]qr^] ?++ ?++993310 + !#32  !%#57'5xsf""{" +55  ; )=  @  `Y_o-   + `Y _Y _Ytdt`T +DtdT@0 9q +_qrrrrr^]]]]]]]]]]]]]]qqqqqqqqrrrrrrr ?+ 3?++ 9/_^]_]]+933 +3310!57'5!#'&+!73#'B p‹==Z555Ѡ +d  y @   ›[ ?+9310%#"&54632yE44EF33F\1HH13FF   0]O_< + D    !E W  + 9    T9 + H9 )s ; ;s ; @d< +  2    / \    n  V    V  +   f    m            z  +  /  C  nQ    .  5   >   X        $   +  62    h             +    (         8   \    j j    4 Digitized data copyright (c) 2010 Google Corp +oration. Copyright (c) 2012 Red Hat, Inc.Liberation SerifRegularAscender - Libe +ration SerifLiberation SerifVersion 2.00.3LiberationSerifLiberation i +s a trademark of Red Hat, Inc. registered in U.S. Patent and Trademar +k Office and certain other jurisdictions.Ascender CorporationSteve Ma +ttesonBased on Tinos, which was designed by Steve Matteson as an inno +vative, refreshing serif design that is metrically compatible with Ti +mes New Roman. Tinos offers improved on-screen readability character +istics and the pan-European WGL character set and solves the needs of + developers looking for width-compatible fonts to address document po +rtability across platforms.http://www.ascendercorp.com/http://www.asc +endercorp.com/typedesigners.htmlLicensed under the SIL Open Font Lice +nse, Version 1.1http://scripts.sil.org/OFL D i g i t i z e d d a t +a c o p y r i g h t ( c ) 2 0 1 0 G o o g l e C o r p o r a + t i o n . C o p y r i g h t ( c ) 2 0 1 2 R e d H a t , I n c . L i b + e r a t i o n S e r i f R e g u l a r A s c e n d e r - L i b +e r a t i o n S e r i f L i b e r a t i o n S e r i f V e r s i o + n 2 . 0 0 . 3 L i b e r a t i o n S e r i f L i b e r a t i o n +i s a t r a d e m a r k o f R e d H a t , I n c . r e g + i s t e r e d i n U . S . P a t e n t a n d T r a d e m a +r k O f f i c e a n d c e r t a i n o t h e r j u r i s d i + c t i o n s . A s c e n d e r C o r p o r a t i o n S t e v e M +a t t e s o n B a s e d o n T i n o s , w h i c h w a s d e + s i g n e d b y S t e v e M a t t e s o n a s a n i n n +o v a t i v e , r e f r e s h i n g s e r i f d e s i g n t h + a t i s m e t r i c a l l y c o m p a t i b l e w i t h T +i m e s N e w R o m a n!" . T i n o s o f f e r s i m p r o + v e d o n - s c r e e n r e a d a b i l i t y c h a r a c t e +r i s t i c s a n d t h e p a n - E u r o p e a n W G L c h + a r a c t e r s e t a n d s o l v e s t h e n e e d s o +f d e v e l o p e r s l o o k i n g f o r w i d t h - c o m p + a t i b l e f o n t s t o a d d r e s s d o c u m e n t p +o r t a b i l i t y a c r o s s p l a t f o r m s . h t t p : / / + w w w . a s c e n d e r c o r p . c o m / h t t p : / / w w w . a s +c e n d e r c o r p . c o m / t y p e d e s i g n e r s . h t m l L i + c e n s e d u n d e r t h e S I L O p e n F o n t L i c +e n s e , V e r s i o n 1 . 1 h t t p : / / s c r i p t s . s i l + . o r g / O F L  ! d PAbe hg  +` N_ UA =@ U@ B UC =B U.= = > U< =; U ; ?; O; + ; ; > U0 =/ U/ > U- =, U , , > U? => UJ H U +G H UF =E UE H UI =H U `  ?  @ P +)O_0 P`p  8=U=U < 0P݀ݰU 0p/ +O`P›`›/.G' FOL = NAM  M M /M OM oM M  L L /L@8_ ++{‹pvvsP)on+nG*3U3U@Ib%(F`_@_P)[Z +0ZG)3UU3U?OoRPQPPP@P FOO/O@eK! +(F`JpJJIF)HG8GG/GGGG_GGFFF@F)/F@F!FHU3UU3U + U3 U/_ ? TS++KRKP[%S@QZ UZ[ +XY BK2SX`YKdSX@YKSX BYsst++++++++sstu++s +u+t++ +^s+++++ ++++++++ +sss sssst+++s+++ss ssss s+ss ++s^s+++^s^s ^s +^ssss+s ssssssss +++++++s++++s++++++++s^ endstream endobj 6 0 obj << /FontName /BAAAAA+LiberationSerif /StemV 80 /FontFile2 5 0 R /Ascent 891 /Flags 6 /Descent -216 /ItalicAngle 0 /FontBBox [-543 -303 1278 982] /Type /FontDescriptor /CapHeight 981 >> endobj 7 0 obj << /Length 438 >> stream /CIDInit/ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo<< /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName/Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <00> <FF> endcodespacerange 10 beginbfchar <01> <0041> <02> <0020> <03> <0073> <04> <006D> <05> <0061> <06> <006C> <07> <0050> <08> <0044> <09> <0046> <0A> <002E> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj 8 0 obj << /LastChar 10 /BaseFont /BAAAAA+LiberationSerif /Subtype /TrueType /ToUnicode 7 0 R /FontDescriptor 6 0 R /Widths [777 722 250 389 777 443 277 556 722 556 250] /Type /Font /FirstChar 0 >> endobj 9 0 obj << /F1 8 0 R >> endobj 3 0 obj << /Font 9 0 R /ProcSet [/PDF /Text] >> endobj 10 0 obj << /Lang (nl-BE) /OpenAction [1 0 R /XYZ null null 0] /Pages 2 0 R /ViewerPreferences << /DisplayDocTitle true >> /Type /Catalog >> endobj 11 0 obj << /Creator <feff005700720069007400650072> /Title <feff00440065006600610075006c0074002000470056004300200074006500 +6d0070006c006100740065002d0055004b> /Producer <feff004c0069006200720065004f0066006600690063006500200036002 +e0034> /CreationDate (D:20200626184830+02'00') >> endobj xref 0 12 0000000000 65535 f 0000000015 00000 n 0000000422 00000 n 0000011269 00000 n 0000000218 00000 n 0000000522 00000 n 0000010335 00000 n 0000010537 00000 n 0000011029 00000 n 0000011236 00000 n 0000011326 00000 n 0000011477 00000 n trailer << /Info 11 0 R /Root 10 0 R /Size 12 /ID [<e96dcc1e51afc88fdd4e014ad07cc03b> <e96dcc1e51afc88fdd4e014ad07cc +03b>] >> startxref 11763 %%EOF


I just added it so that you could see the correct start locations of the different objects. The list of object starting positions can be found at the end of the above document (copy/paste it in an editor should reveal this...) and is also given below for clarity. There are apparently 11 COS (first line must be ignored) and e.g. the 3rd COS has an offset of 11269 bytes from the beginning of the file (that one is already after a section that has binary content in it).

0000000000 65535 f 0000000015 00000 n 0000000422 00000 n 0000011269 00000 n 0000000218 00000 n 0000000522 00000 n 0000010335 00000 n 0000010537 00000 n 0000011029 00000 n 0000011236 00000 n 0000011326 00000 n 0000011477 00000 n


Now for the dumped version of the above, I've used your script and sent the output to a file. This is the result (same here, used "readmore" and "code" to void the direct visibility of it):

my $data = do { require MIME::Base64; MIME::Base64::decode("JVBERi0xLjUKJcOkw7zDtsOfCjIgMCBvYmoKPDwvTGVuZ3 +RoIDMgMCBSL0ZpbHRlci9GbGF0ZURlY29kZT4+CnN0cmVhbQp4nCWKsQrCQBBE+/2KqYW +cu2s22YNjQUELu8CBhdip6QTT+PtelIHHMPM4CT70BoMTq8OyJR0N3kvyQbA86LLB62+0 +LDMdKvWaBozuSTyj3rE9CURRn9fCEmaFlXfRaeE+Giyk8BCdrdT1HKNrk/+UHN76Pm71T +MdKE034AkzVH9wKZW5kc3RyZWFtCmVuZG9iagoKMyAwIG9iagoxMjkKZW5kb2JqCgo1ID +Agb2JqCjw8L0xlbmd0aCA2IDAgUi9GaWx0ZXIvRmxhdGVEZWNvZGUvTGVuZ3RoMSA5NzQ +0Pj4Kc3RyZWFtCnic5ThtUFvXlee8JwmBAElEYIRs68kytjEfAh527MQYGZDABhvZQCo5 +iZFAAikBSZEEjpPNhm2+PDiu3TSb5ms2bif1Jl3v5BG7XaeTjUk3aZvppkm2aafd1I1nm +m6303jjpm6mmxSx5149MFAnne7sv73Sfe98n3PPOfe+J2VS4xEohEkQwT00FkquKTKUAM +C/AmDJ0ERGau4pvZ7gCwDCvw0nR8Ye/6ebLgNozgDknRkZPTRsPTm5A6AwCpA/E42Ews3 +lx1wAZRvIxuYoEfZmD+URHiR8bXQsc/s28bWNhN9LeM1oYiiUMaUKCFcILx8L3Z7cqHEL +hH+PcCkeGotILZpOwv8TwDCRTKQzYVg7B7CG2ZOSqUiy+/HBVwmfBBCPEw3pw0YhgTqGC +6JGq8vT5xfA/8+hPQql0KltBiMk+XXJEE+BFR4DmHufYVeu2e65j/8vo9Dnbo/CSTgDR+ +GncLPK8IIPYjBOlMXjZXiLqGz4YD98HaY+xewpOEv8nFwQjrGVXHX44MtwGr67xIsPxuB +OiuUb8FNsgNeoVRLwIerhb+BVsvoh0XZfzZRQTJdhDg4vor4DTwhHYJfwHiGPMY7gEkzw +CjyJB8hyhtZ5dGHF2/7E6ANwF117IQoTBPOhbf7jv0P+3O9oVXfBLvg87IDRRRov4lMit +bTYB09RTl/mNNc8M69TvEX4piDMfomQL8IIzRDS2oWj4o5PydBfPMR+KMIqsRLyr8YVms +CY/VhonLssroUC6J+7NE+b65r7nRjKxjUDmpXaZs33P8uH7ouaMdKGuV9m78yGtXu0J6l +azwC4O27cH/D39/Xu2+vr2bO7u2vXzs4Or6e9rXWHu2V787brr9u65drNmxrqXXW1NRvW +r6tc61zjsJdbzCZjcZGhIF+fp9NqRAGhRlIw6FHESsnsDTk9zlBnbY3kKY+219Z4nN6gI +oUkhW6adc7OTk5yhhQpKCnr6BZaRA4qbpIcXibpzkm6FyTRJG2DbcyFU1Jeb3dKZ3H/Xj +/BR9udAUm5yOHdHNas40gRIQ4HafCoWLSSR/FORKc8QYoRpw0Fbc62SEFtDUwXGAg0EKR +scCanccN25ICwwXPdtAD6IuaWVuoJhRXfXr+n3eZwBGprdirFznbOgjZuUtG1KXncpBRj +ocMRabpmZurBsyYYDFYXhp3h0E1+RQyR7pTomZp6QDFXK1XOdqXqjvfKaeURpcbZ7lGqm +dWufQt+uq64REVbaXJKU78HWo7z4vtLKSGVoqs0/R4YqAhtCu7zO9iweSnXU1Nep+SdCk +6Fzs5NDjolk3NqurBwKumhdIPPTybOzn3riE3xPhhQTMEoXhdQl+7d16Vcs/dGvyJUeqV +oiCj0bXE6ttgc5gUZ36exgdJCyaEMOxwsDUfOumGQEGVyrz+HSzBoex7cruqAIgQZZ2ae +U9rPOJPznAX1oJNq29Xrn1I0lTvDTg9l/EhImRyk7rqFFcZpUoo/sjmcUyVmaasrwGUli +mpnOCYp2nWUJNJarEB9w1SmTBwp/ih3u2gjB+vMJdJWJ5lhdjxOT1D9TkTLyYBEie6szj +VCn19xtxPgDqkV80zXu0gjFKSCxdp5MRWXM6lYnK0L1WVheWK9fq6iqimWNgWCQ6qW4vL +wfSV5poLtuRCYLede/wsgz12YbpJsp2VogkA7Ey5roy5b55nyh4cVe9AWpn03LPltDsUd +oAoHnP5IgLUdZajqgo03R4D3Sp+/q9fZtXe/f4saSI7BzGkqPcvMOP22nBlqQEVfqZf8g +k0MkKCJCJKXAGfrNroqeZV6miZKOKeyxm3dJvnRBvPSFIZSJXki7aocw5cY1bJ2auuct6 +ZjKNlp67Q5Ao7cqK0RiC2pjklDz5LaOc+iY4oYeurPtk5OYrksZ00v+Z0RZ8AZlRS3z8/ +WxtLDs6wmg+dcrVXfEmxRsihN4CD2PMKSqXirbYuTq3RwfAHtXMbeOc+WpvTOrt4pZtyp +GgSKfKcCrIXdW8w2fhawDe2ks1cy0ZbmG3pq2u1mmzl6HTPi3Bmecvb6t3FpOk/ust3Bf +JVAF3b1tdbW0NHWOu3Ew3un3Xi4d7//BRO9Fx7u8z8voNAWbA1MryWe/wWJHhqcKjAqIz +JEYgiztI8QPZe3veAGmORcDSdwfOgsAqfp52kIQ2eFHM2Uc7SOO3KDQBxNjuOel9YQTZ+ +jTXIaH9PAUuYu0Lr17nx3oVAk2KaRkZ4nyrfoPTYf4XQhFqFtmrT2cfJZnJzOd9tyEpMk +4c5FeLj/iuv+/f7ThfR0tvErOWplg9qlPErFpseKRwqzRvmrQHQqGGCbjd6/hUr6ooLO7 +VQm53YKRFeoFDgjrYrB2croLYzekqPrGD2PWhTLkNQnqfY+BVkH3Oh30JaUKl6zTZkusk +oF6FCZMv2yloKz0FvNWW0nvYOW4ID7Q3Ox0agpKTIVFublmTTiNZaiYnNxMFBiNqOJns+ +FeRojGgcCBVhy2YLvWfBtC75iwTMWfNqCD1vwXgtmLBi2YJ8F2y3YZMG1FrRYUGPBv1R+ +62coLJbWcJkZCwqKBU9Y8LgFJy2YtKDPgm4L1ltQsqDJghe40DKBHgverI7bFsbAbbell +owDNy8bty0b0CJXm0GW5fIWWS7Z6pKroU42l+CKrWa6bWXXrVsb6itLHZuuRRlXsLvoEF +F04OvZjkfxtZfwna/PvnbmvtlLD+CR/8Afbtq0yab5wyd6G93xnuxdmujsOHv5Quibe1/ +4ofgqbICAu8mRZ6koohJWbSxyiCtWrPYFbCtMosEXyBPLJjdiciMGN6JvI0ob8bmNOLAR +ezbifPgsZihngZcARViCLEyZvg31aNE516xbv0leUSY3bmpyYZ2wqWmz3Lii1Ll+nXONr +tRStmK1KPxw+h+9z9bXNnTd/u3HApGbGp89PvKEa+Om1N7+3Xu+tL/FifoHj68q+dU97S +fvaFrlaB/y/tUx++tjLl/71j0VjXVtN/D1VNMbpZXW04DPu+fMhbqVKx2wYUNtraNQlBs +b6nyBBuMGx0pzYW11rS9gN1aXWnW6/HzLvkC+aT298IqV+wKiaULGG2TcLONaGctk1Mn4 +kYzvyfi2jN+R8WkZH5FxUEb0ydguYz2Xs8iokTF6aV7wjIwZGd0yNnE28S7L+I6MMzIq3 +Ma9MoZl1UROxjQv9qaMr8j4DzIe52K3yni9jNK8jy05BydkDMrYN+/DwjXf45oPyzhJ7t +3Vi/g2rvseD0BQuECSuyevRhn1alMOLOnGq3Xskk5d3tMDS4SghbWxmdp4/i6rjS1zmtr +R6ocaZTvKjWUr2NWa62pzk3NNsZBXVmpWUeqXPBVev8mB3q5n3J7xVbvfaL90KNv/4IkK +j6el1Hw023qkv99/z9HsDQcP4jVisPq6pq3VrdnfzD5ira21Cv5T+oIizeYd82hvYNWsl +YGiZK3lP22gc+598TbxZbBBJYy5W8z6ykqNVFho1Yj0U2NNwZq9gfJSs3mlL2A0281CoW +g2g76gLE9De6UUSn0BME2ux4H16F6PBNxMiaJk0P7ge4S29MCBm/lO4Ym5kgeWhMbcatf +TpjE3bccW3MRWbUTnps2YV4ylFrlx87X41uNfHM9mr0lN/3bniUePduwK967Z8lWEe+4f +ONY+1Ci+/Nefn73PWnsgheUH7twhar4Uusk1/rozu1qjPRBX7OVsr1TRQk3ao/Tr7qvup +JZ+M+l8AfqlpxW1voBY+rYBXzHgGQM+bcCHDXivATMGDBtwrQEtBtQY6ETlEscNKCQNGD +Sgz4BuA84YUDHgCY6aDAgGvMRRklsstqRRWAcNLOuzXPPk2oQddleKnvhK1nriBHq9rGR +aoZw9d64H0Di0zaCHcphxT0KptqDAWGqssObrgoH8/KKSEpEeO6aBQIlYYCyiR05RybEK +vLsCExXoqkBjBb5bgecq8ClO6anAFk6f4/Q3OHGAi23JyZ3jyjnN57ja3VzHzinzuym3t +gNLNg7fLguL401Q7qpWD0x1HzStq0YzlVl7pdUlM52i4v7HvzkYffYr2T0/mv3+U6fwY3 +z/v38tKl/7wux9j1/OtrLjXfN3FZuy4z/4McvJ3B+145STfFgB3W6X1gJFlqJy64rSgcA +KTTCwQjRZBgKmvGDAVAJWbHFbUbLiBSuesGLSyg92Cp0i5X17JUKKzwQOp5kO8xKUwExI +pZMHqXk6+1b2V2du/9pHv579A6ZxOPv32Weza06dOiU8g1Zc88mdelwjvpr9RvZMVsme1 +OSi5ec2q18VxboCLrtPlpWUWMyIOp3FIFrLzRAMDJgTZqHWjCK5Mwv5WrOZTm0TVTZvIJ +Avok6jGwhoSs5Y8WkrPmzFSStmrBi2osaKl6z4nhXf5nQiBq3YZ8V2K75pxVesuKBy77w +KcSkT9TwZFm5h62VuIidH+IyVTlAr0oP9ao/zZUfj8krTxqdDcunZRxm9UmgzPRSp+o5r +ZYLwmXdnX37qlPhfrVLy7XfwiL252S7sn/1oodLnflo8+9aJbPirlMMH5n6Bh+BHYIByt +wF0usIiMf+JG8VrILeHmJ9K/ijmT98yPORpavJ4Zdl7U0NnZ4Ps9fL/VQWA+pt7TAPGbb +8X7Ln/9L7X/uYPrvxjM/cLnVXL/unSL5BIL8+R9cDnFlOWDINuK50s3wWLwP5XWgV9hFe +LR6FT2ApVmjRcT7zr2V0A9d+xn8BP0C2EhR+L28TvaFZqXtJc4FYN0Mhi5JGawAU3EfAv +4ndA5NzVGF/wfcNCHEiSN6iwAHkwrMIine9jKqwhmcMqrIVieFSFdfQue1KF8+AO+IYK6 +8GCLhXOh2JsU+ECjONeFTbASuHcwr/VdcI7KlwEm8R8FS6GCnE7i17D/mU7JfpVGEHSaF +RYgGLNWhUWYbOmUYU1JBNVYS2s1BxWYR2s1jytwnlwWfNtFdbDBu03VTgfVmp/rsIFws+ +0H6uwAbbof6zChXBTfrEKF8Et+beocDE05b/dHhuJZWJ3RMJSOJQJSUOJ5KFUbCSakTYM +VUmN9Q31UkciMTIakdoSqWQiFcrEEvG6grblYo3SPjLRGcrUSDvjQ3XdscFITlbqjaRiw +/siI+OjodSO9FAkHo6kpFppucRy/IZIKs2Qxrr6+jr5Cne5cCwthaRMKhSOjIVSt0qJ4a +WBSKnISCydiaSIGItL/XW9dZIvlInEM1IoHpb6FhR7hodjQxFOHIqkMiESTmSiFOot46l +YOhwbYt7SdQsrWJSO3kxkIiLtDmUykXQi3hpKky+KrC8WT6RrpIPR2FBUOhhKS+FIOjYS +J+bgIWmpjkTcEK0lHk9MkMmJSA3FPZyKpKOx+IiUZktWtaVMNJRhix6LZFKxodDo6CGq2 +ViStAapSAdjmSg5HoukpT2Rg9K+xFgo/vW6XCiUm2FKqhQbS6YSEzzG2vRQKhKJk7NQOD +QYG41lyFo0lAoNUcYobbGhNM8IJUJKhuK1nvFUIhmhSD/X0X1FkALMZTOdGJ0gz0w6Hom +EmUcKeyIySkrkeDSRuJWtZziRokDDmWjtosiHE/EMqSakUDhMC6dsJYbGx1idKM2Z+eBC +Q6kE8ZKjoQxZGUvXRTOZ5HUu18GDB+tCammGqDJ1ZNn1WbzMoWRErUeKWRkb7abyx1npx +nl92SJ6d3ZLPUnKj5eCk1SBGmm+NRvqGlQXlMZYMpOuS8dG6xKpEVePtxvaIQYjNDM074 +AIhEGiGSI8RNAQJCAJhyDFpaJElejX2xC9zUl0MtZDA00JOkgqQfxR0pegjeAUabFriNt +NQBzqoIBzPttaI0H71Cg6uXYNQTtJf4gsdJPeIHEX25Wgl1NidM4yzREYpzhCRNkBadKK +kEyYS0hQS/PP2fhz/Bs4lF7gNFJc9fSpA/mqun/OcoxsSTzXGc5hsY7x+G8lWoL0PisjE +slFeP3SxIlwLMytMtv9JNHLpXxck+Uiw73FuVTfVTz2kMdh0h/itZyXHOK2WU/kLCcIjq +pZvYUynuIRhLne/NrS5PlPa3D17ujl0U1wn7s5neFpzmslPK2uK5ezPh5FgqgsFwcpEuY +3yuEQz2eYa7Mui6uag9R30mf6kVTdkFqXOPcxoUbJdGrUfA/za5r7jZMPiceXq/JS3xLP +U4hnPVfpMeJmuOwQ0Ufpc0jdZ2OUlZyvQXUnHeT7MqqueIzblWAP3Q/yrkjwusUda3iNr +2Ql1zfDaqdKXDdJcIKvYj6Ptbw2bCURHimDQnzvD5LGKPediy3KuyPEaxtRa53hK5jPV1 +hdKYs6ySm14OF9wXZ8RM3p5+ik6L6qxVwGF/cmq8kojze9yHacRxteWGMu20xqVPWUW/E +oP5FuXajPMO+3XEbD3Frtp+R8mOcmo3pN8IjC9MlVPNdbCdId5/XI7adcN2f+JHMhnt+E +qpfk51JGjWWM748o78AkXEfvli6Kjn3qeB8u3jVD6p6pU2N2/a/1WFxJnsHF+yO1EMsYx +dit7v74wq4bX7R/5yvRS2dQNz8vkmr/eNXMScsssF2z/NRsIH8Ny1aR68YY4RkeT5rnso +6vYYT4PeShm79H534ZOCimq4zpfN+OQYwAYhRH4BqwYxD24AD04w5oRjfd3cRrpXsb4ex +eh80wSXLNRN9O+DaiX0+Hp52uLTR7aB6jqaGZk6gnCRfdXSpeS3gNabxBV+STUVuIyu67 +CO+ke4d69xLdQ3ePiu8knO4QxDx6EW/h13OocZ/GC7P4xixKs3j3J+j7BCc/PP6h8NtLV +fbnLp27JPR8MPDBcx+I9R+g8QPUw0XTRd/F4MXkxRMXdQXG97EQfoPmX1zYYn+3+Xz/z5 +t/1g/naWXn68/7zk+eV85rz6PY/zOxzG6akWbqZ5IzkzNvzlyYuTSjn3zp+EvCP7/osht +ftL8o2E/3nL77tBh8Bo3P2J8RfE8EnxCOP4nGJ+1Pup4UH3+szv5Yx2r7lx9Zb7/wyKVH +hLNzM6cfKTJ7X8Qe7IZmyuGe0+Kc/bkdpbiblmWkq52mi2YPzQTNYzTpdw+J22m6sNu9R +Rz4WzQ8ZHuo+qE7HzrykDZ5/+T9x+8XJ+87fp/w3MS5CSHtq7In4tX2eMdGu1Uu78+TxX +4duSHv7p2DlRu8wQG3fYCEbtxfb9/fUWW/Ri7p19KCNSRoFO1ii9gjJsRj4jkxT7/Pt9q ++l+YF3yWf4PblF3qNPfYeV494du6CO9LlIGu7krsmd4k7vVX2zo4tdmOHvcPV8UbHux0f +dOgGOvAp+nqf857zim5vlcvr9q52eFd22vrL5NJ+Mxr7TbKxX0AqtAz9LuOcUTAaB4x3G +0UjtIAwWYZaPIvHp/t6q6u7zubN7etS9L4bFTysVPayq3vvfkV3WIH+/Tf6pxG/ELjv6F +FoXdWlNPb6leCqQJcSJsDNgEkCTKumy6A1kE5nqvnA6mqCx+kK1ePVRDyQzlFhgQ/VaUz +TGZXmSljNBHI40rWa8YjA9JC0D6SBXRizOqfEtNOqOa6cu3Cg/MD/AK8VjkMKZW5kc3Ry +ZWFtCmVuZG9iagoKNiAwIG9iago1NjIzCmVuZG9iagoKNyAwIG9iago8PC9UeXBlL0Zvb +nREZXNjcmlwdG9yL0ZvbnROYW1lL0JBQUFBQStMaWJlcmF0aW9uU2VyaWYKL0ZsYWdzID +YKL0ZvbnRCQm94Wy01NDMgLTMwMyAxMjc4IDk4Ml0vSXRhbGljQW5nbGUgMAovQXNjZW5 +0IDg5MQovRGVzY2VudCAtMjE2Ci9DYXBIZWlnaHQgOTgxCi9TdGVtViA4MAovRm9udEZp +bGUyIDUgMCBSCj4+CmVuZG9iagoKOCAwIG9iago8PC9MZW5ndGggMjcwL0ZpbHRlci9Gb +GF0ZURlY29kZT4+CnN0cmVhbQp4nF2Rz2rEIBDG7z6Fx+1hiclmsy2EwJLtQg79Q9M+gN +FJKjQqxhzy9tVx20IPym+Y+Ybv06ztLp1WPnt1RvTg6ai0dLCY1QmgA0xKk7ygUgl/q/A +WM7ckC9p+WzzMnR5NXZPsLfQW7za6O0szwB3JXpwEp/REdx9tH+p+tfYLZtCeMtI0VMIY +9jxx+8xnyFC172RoK7/tg+Rv4H2zQAus82RFGAmL5QIc1xOQmrGG1tdrQ0DLf72cJckwi +k/uwmgeRhkr8yZwgVywyAfk0yFyiVxdIh8T43yVuI18Qj6i9j7tLCM/JK4in9P+RzR2cx +Atxjf8iU7F6lyIjQ+NeWNSpeH3L6yxUYXnG5CdgocKZW5kc3RyZWFtCmVuZG9iagoKOSA +wIG9iago8PC9UeXBlL0ZvbnQvU3VidHlwZS9UcnVlVHlwZS9CYXNlRm9udC9CQUFBQUEr +TGliZXJhdGlvblNlcmlmCi9GaXJzdENoYXIgMAovTGFzdENoYXIgMTAKL1dpZHRoc1s3N +zcgNzIyIDI1MCAzODkgNzc3IDQ0MyAyNzcgNTU2IDcyMiA1NTYgMjUwIF0KL0ZvbnREZX +NjcmlwdG9yIDcgMCBSCi9Ub1VuaWNvZGUgOCAwIFIKPj4KZW5kb2JqCgoxMCAwIG9iago +8PC9GMSA5IDAgUgo+PgplbmRvYmoKCjExIDAgb2JqCjw8L0ZvbnQgMTAgMCBSCi9Qcm9j +U2V0Wy9QREYvVGV4dF0KPj4KZW5kb2JqCgoxIDAgb2JqCjw8L1R5cGUvUGFnZS9QYXJlb +nQgNCAwIFIvUmVzb3VyY2VzIDExIDAgUi9NZWRpYUJveFswIDAgNTk1LjMwMzkzNzAwNz +g3NCA4NDEuODg5NzYzNzc5NTI4XS9Hcm91cDw8L1MvVHJhbnNwYXJlbmN5L0NTL0Rldml +jZVJHQi9JIHRydWU+Pi9Db250ZW50cyAyIDAgUj4+CmVuZG9iagoKNCAwIG9iago8PC9U +eXBlL1BhZ2VzCi9SZXNvdXJjZXMgMTEgMCBSCi9NZWRpYUJveFsgMCAwIDU5NSA4NDEgX +QovS2lkc1sgMSAwIFIgXQovQ291bnQgMT4+CmVuZG9iagoKMTIgMCBvYmoKPDwvVHlwZS +9DYXRhbG9nL1BhZ2VzIDQgMCBSCi9PcGVuQWN0aW9uWzEgMCBSIC9YWVogbnVsbCBudWx +sIDBdCi9WaWV3ZXJQcmVmZXJlbmNlczw8L0Rpc3BsYXlEb2NUaXRsZSB0cnVlCj4+Ci9M +YW5nKG5sLUJFKQo+PgplbmRvYmoKCjEzIDAgb2JqCjw8L1RpdGxlPEZFRkYwMDQ0MDA2N +TAwNjYwMDYxMDA3NTAwNkMwMDc0MDAyMDAwNDcwMDU2MDA0MzAwMjAwMDc0MDA2NTAwNk +QwMDcwMDA2QzAwNjEwMDc0MDA2NTAwMkQwMDU1MDA0Qj4KL0NyZWF0b3I8RkVGRjAwNTc +wMDcyMDA2OTAwNzQwMDY1MDA3Mj4KL1Byb2R1Y2VyPEZFRkYwMDRDMDA2OTAwNjIwMDcy +MDA2NTAwNEYwMDY2MDA2NjAwNjkwMDYzMDA2NTAwMjAwMDM2MDAyRTAwMzQ+Ci9DcmVhd +GlvbkRhdGUoRDoyMDIwMDYyNjE4NDgzMCswMicwMCcpPj4KZW5kb2JqCgp4cmVmCjAgMT +QKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDA2Nzg4IDAwMDAwIG4gCjAwMDAwMDAwMTk +gMDAwMDAgbiAKMDAwMDAwMDIxOSAwMDAwMCBuIAowMDAwMDA2OTU3IDAwMDAwIG4gCjAw +MDAwMDAyMzkgMDAwMDAgbiAKMDAwMDAwNTk0NiAwMDAwMCBuIAowMDAwMDA1OTY3IDAwM +DAwIG4gCjAwMDAwMDYxNjIgMDAwMDAgbiAKMDAwMDAwNjUwMSAwMDAwMCBuIAowMDAwMD +A2NzAxIDAwMDAwIG4gCjAwMDAwMDY3MzMgMDAwMDAgbiAKMDAwMDAwNzA1NiAwMDAwMCB +uIAowMDAwMDA3MTk4IDAwMDAwIG4gCnRyYWlsZXIKPDwvU2l6ZSAxNC9Sb290IDEyIDAg +UgovSW5mbyAxMyAwIFIKL0lEIFsgPEU5NkRDQzFFNTFBRkM4OEZERDRFMDE0QUQwN0NDM +DNCPgo8RTk2RENDMUU1MUFGQzg4RkRENEUwMTRBRDA3Q0MwM0I+IF0KL0RvY0NoZWNrc3 +VtIC9BRTk0RDREMDE2Mjg2RUQ4MUY5QkFDQ0FGOEVCN0U3OAo+PgpzdGFydHhyZWYKNzQ +3OAolJUVPRgo="); };


And finally, the code I've used to search for the different locations where x 0 obj occurs. x is a number going from 1 to - in this example - 11.

To use it, first the above content (full or dumped code) has to be saved into a file of your choice and then call the script using perl sscce.pl <filename>.

use 5.010; use strict; use warnings; use Path::Tiny; my $data = path($ARGV[0])->slurp_raw; my $outcome = get_nr_of_cos_objects(); find_obj_blocks($outcome); # Returns the number of COS items found in an uncompressed PDF file # We're looking for a line that starts with "/Size", followed by a # number that indicates the number of objects found. # Note that we have to subtract 1 from the result found since the firs +t # item in the XREF table is referring to a non-obj block (root) sub get_nr_of_cos_objects { my $regex = qr/(\/Size \d*)/; my $result = $1 if ($data =~ $regex) or -1; $result = $1 if $result =~ /(\d+)/; $result -= 1; # $1 is the result of the above regex execution. say "Number of COS: " . $result . "."; return $result; } # Searches for the occurences of patterns like "^x 0 obj" where x is # a number from 1 to nr of objects passed as a parameter. sub find_obj_blocks { my @counter = (1..$_[0]); my $result; for (@counter) { if ($data =~ qr/^\Q$_\E 0 obj/mp) { $result = $_; printf("Begin of Block [%5d] found at position [$-[0]]\n", + $_); printf("End of Block [%5d] found at position [$+[0]]\n\n +", $_); } } }

Since all is going well now, I was able to scan my whole PDF file and compared to the speed I had with Python to do the same, it's blazing fast in Perl!!! What took "forever" in Python - say, half an hour or so? dunno really, since I never got the patience to wait until it was finished - takes not even half a minute in Perl...

Indeed, once again it's proven Perl is much, much better, powerful and faster for text processing than almost any other language (apart from pure C, I guess)!

Thanks anyway for your patience and willingness to help me. Much, much appreciated! Others too!

Best rgds,
Geert
  • Comment on Re^6: Calculated position incorrect when using regex in text file that also contains binary info
  • Select or Download Code

Replies are listed 'Best First'.
Re^7: Calculated position incorrect when using regex in text file that also contains binary info
by haukex (Archbishop) on Jun 29, 2020 at 14:52 UTC

    I'm glad you got it figured out, and thanks for taking the time to make the SSCCE! We're here for Rubber duck debugging too ;-)