Author Topic: Any .pdf experts on the forum ?  (Read 14233 times)

Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Any .pdf experts on the forum ?
« on: January 03, 2015, 10:00:45 AM »
I have spent some time scanning the rather tatty manuals for my Fanuc Wire Eroder into searchable PDF files, and have got the bulk of the work done, but two aspects have me stumped.

Firstly, the first two pages of one manual are particularly grubby, and on originally a yellow back ground. I'd like to lift the text from them and drop it onto a clean page. Now you can search individual words on the page so I'm sure that it must be possible to grab the text somehow and drop the back ground - but how? (NB certain lines on these two pages have 'links' on them to other pages in the document that won't work as you only have the two pages not the rest of the document as it's 2 x 36 mByte !)

Secondly, the scanning process uses an OCR program to make the pdf searchable, which is excellent, BUT it also decides to rotate some pages into 'landscape' to get the text the right way up. I want them all in 'portrait'. The pages as scanned are rather varied in size, but roughly A4. I can use a utility to rotate the page, but if I then use another utility to make all the pages A4 sized, it inverts these pages. This seems to happen whatever order I do the conversion in.

It's frustrating as it's VERY close to what I want, but not quite there  :bang:

Total job is about 500 pages of scanned and OCR'd A4 which I've done, and gone through cleaning up grubby finger marks and edge effects - it's just these two aspects baulking me at the moment if anyone can help.

The scanner is a Plustek Opticbook 3800
The scanning software is 'Book Pavilion' and the OCR software it uses in the back ground is 'Finereader Sprint 9' both of which are bundled with the scanner.

The PDF utilities I've been using are: PDFill Editor ($20) and it's free PDF tools

http://www.pdfill.com/

Incidentally I chose this scanner as it can scan up to 2 mm from an edge, so a book can lay on it with half hanging down the front and still scan into the fold without leaving a blank place that would loose text in some books.

http://plustek.com/usa/products/opticbook-series/opticbook-3800/

Any help would be appreciated.

Andrew Mawson
East Sussex

lordedmond

  • Guest
Re: Any .pdf experts on the forum ?
« Reply #1 on: January 03, 2015, 10:10:33 AM »
Andrew
I am no expert with pdf. But the SIL uses it a lot to scan in using a scan snap scanner into the acrobat program ( not cheap ) and it's the acrobat program that should do what you want . He uses it for his business governed by the FSA so I cannot go int details but he does run a new Porsche

Note I do not mean the free reader but the one to generate the proper PDFs. Inc setting passwords and copy limitations

Stuart

Give it a try here
https://www.acrobat.com/en_us/free-trial-download.html

Edited for link

Offline woodguy

  • Jr. Member
  • **
  • Posts: 36
Re: Any .pdf experts on the forum ?
« Reply #2 on: January 03, 2015, 10:32:54 AM »
If I understand you correctly, you wish to extract the text and place it on a new page which would then have a background of your choosing.  If you scan the document to jpg or pdf, you can use a number of methods to extract the editable text, then place it on a new page.

See: http://www.wikihow.com/Convert-Images-and-PDF-Files-to-Editable-Text

Offline Brass_Machine

  • Administrator
  • Hero Member
  • *****
  • Posts: 5504
  • Country: us
Re: Any .pdf experts on the forum ?
« Reply #3 on: January 03, 2015, 10:39:10 AM »
Hi Andrew,

I have an expert I can pose these questions to. Will let you know.

Eric
Science is fun.

We're all mad here. I'm mad. You're mad.

Offline vtsteam

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 6466
  • Country: us
  • Republic of Vermont
Re: Any .pdf experts on the forum ?
« Reply #4 on: January 03, 2015, 11:18:40 AM »
Maybe Inkscape, Andrew.

https://inkscape.org/en/

(re-reading your post, not quite sure what you want to do exactly other than lift text off of yellow pages -- didn't quite get the other part. Inkscape will import pdf docs and convert them to editable svg or do many other things and will also work with text. It will also wrte pdfs.)
I love it when a Plan B comes together!
Steve
https://www.youtube.com/watch?v=4sDubB0-REg

Offline John Stevenson

  • In Memoriam
  • Hero Member
  • *****
  • Posts: 1643
  • Nottingham, England.
Re: Any .pdf experts on the forum ?
« Reply #5 on: January 03, 2015, 11:47:47 AM »
Andrew,
Can't help on the rotate bit but send me the pdf's or jpgs of the two grubby pages.
John Stevenson

Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Re: Any .pdf experts on the forum ?
« Reply #6 on: January 03, 2015, 11:49:15 AM »
Thanks chaps for the various suggestions, which I will follow up./

Meanwhile I've made a bit of progress in that I've scanned those two pages to Word 2007 the idea being if I end up with a good quality printable page, I can then scan it into searchable pdf and edit into my original large document.

.... but now I'm having a hissy trying to sort out setting / clearing tabs in Word 2007 to get the referenced page numbers to line up  :bang:

I'd upload a copy but it's not a format supported by the forum  :bang: :bang:
Andrew Mawson
East Sussex

Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Re: Any .pdf experts on the forum ?
« Reply #7 on: January 03, 2015, 11:50:57 AM »
Andrew,
Can't help on the rotate bit but send me the pdf's or jpgs of the two grubby pages.

They are on the way to you John :)
Andrew Mawson
East Sussex

Offline vtsteam

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 6466
  • Country: us
  • Republic of Vermont
Re: Any .pdf experts on the forum ?
« Reply #8 on: January 03, 2015, 12:02:10 PM »
I'd upload a copy but it's not a format supported by the forum  :bang: :bang:

You can pretty much upload anything if you zip it first.
I love it when a Plan B comes together!
Steve
https://www.youtube.com/watch?v=4sDubB0-REg

Offline Arbalist

  • Hero Member
  • *****
  • Posts: 673
  • Country: gb
Re: Any .pdf experts on the forum ?
« Reply #9 on: January 03, 2015, 03:02:37 PM »
I used to generate PDF Files at work using Adobe Acrobat from Pagemaker and later Adobe InDesign. When done in this way the text is selectable and scalable. I notice its not in your PDF. Other folks at work without Acrobat software used to print Word documents then scan them as PDF. As far as I could tell these weren't proper PDF files with selectable and scalable text etc but just a graphic file like JPEG but with a PDF extension. There seem to be lots of packages out there to produce PDF's but many of them don't seem to work very well judging by the results I've seen.

As a point of interest, PDF was built into Apples OSX software from it's inception so it's extremely easy to produce PDF files without additional software.


Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Re: Any .pdf experts on the forum ?
« Reply #10 on: January 03, 2015, 05:00:03 PM »
The structure of a pdf file is a mystery to me. The "scanned to searchable" pdf's I understand are just an image, not characters as such, but how then they can be searchable I don't know.

JS has done an amazing job cleaning up those pdf's but he's not yet revealed to me how  he did it.

I've spent some time this evening playing with a scanned and ocr'd copy into word format - which was actually amazingly good but with the expected errors where the characters aren't legible. However the way it's done it's formatting and lay out is a nightmare to edit - still can't get all the numbered sub paragraphs lined up :( It probably would be quicker just to generate a new Word document and then convert to pdf rather than fathom the programs peculiarities !
Andrew Mawson
East Sussex

Offline John Stevenson

  • In Memoriam
  • Hero Member
  • *****
  • Posts: 1643
  • Nottingham, England.
Re: Any .pdf experts on the forum ?
« Reply #11 on: January 03, 2015, 06:06:17 PM »
A bit like those raster to vector drawing programs you get. Far quicker to actually draw over the top.

I took the pdf's Andrew sent and brought them into Fine Reader V11 Pro and basically OCR'd them again. As you would expect loads of dross got thrown up but FR11 isn't bad at recognising but you do have to hold it's hand and say delete, ignore, delete etc.

Only problem then is like word, but nowhere near as bad it presents it's own formatting problems like the page numbers at the end of index lines don't match up and putting a few spaces or .... in then totally shags the whole document up to put a none too fine a point on it.

Soooooooooooooo, save as pdf, import into Serif Page Plus V8 which can open native pdf's and using PP desktop publishing features you can get the spacing right etc, etc and save as a PDF.

The reason you can't go into PP straight away is it saves as imported and Andrews pages had a yellow and thumbprint background colour but FR11 can save as just text and omits the background if you want.
John Stevenson

Offline Swarfing

  • Sr. Member
  • ****
  • Posts: 417
  • Country: gb
Re: Any .pdf experts on the forum ?
« Reply #12 on: January 03, 2015, 06:10:34 PM »
Can i suggest booting with a linux distro and use one of the pdf readers from there. I was able to clean up a pdf with the same problem a few years ago. I can't remember the reader/ writer i used but a few of them come with tools to do it. I was able to remove the water marks added to the files as well.
Once in hole stop digging.

Offline SwarfnStuff

  • Hero Member
  • *****
  • Posts: 588
  • Country: au
Re: Any .pdf experts on the forum ?
« Reply #13 on: January 04, 2015, 12:35:32 AM »
For what it's worth from my very  limited use of OCR software. If you can scan your page and save as a BMP you can then use whatever image software you prefer to clean it up to your liking. Re-Save in BMP format and your OCR should recognise it as a clean page. (At least mine did.) Perhaps Jpeg would work similarly, I have not tried that.

This coming from a bloke whose scanner is not recognised nor supported by Win - 7. Linux of the Mint version however has no problem with it.  Why ditch a perfectly good (for my use) scanner just to please Windows?  :Doh:

John B
Converting good metal into swarf sometimes ending up with something useful. ;-)

Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Re: Any .pdf experts on the forum ?
« Reply #14 on: January 04, 2015, 07:25:34 AM »
Well thank you all for your input on this - so now I've solved the 'drop the grubby background' issue and just need to resolve the rotation and cropping to A4 issue on about four pages.

This is the cleaned up pair of pages that I first posted:

Andrew Mawson
East Sussex

Offline Pete.

  • Hero Member
  • *****
  • Posts: 1075
  • Country: gb
Re: Any .pdf experts on the forum ?
« Reply #15 on: January 04, 2015, 07:08:28 PM »
Nice work John. Didn't expect it to come out that clean :bow:

Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Re: Any .pdf experts on the forum ?
« Reply #16 on: January 05, 2015, 03:08:00 AM »
I confess (and hope that I dont offend John) but those are not the ones John cleaned. His were very good but the page number and paragraph indents weren't quite bang on.

In the end I re-scanned the pages into Microsoft Word 2007. Then spent a couple of days re-learning how to set and clear tabs and indents and manually sorted out the formatting. Having done that I printed out fair  copies and again scanned them into searchable PDF's using the original software.
Andrew Mawson
East Sussex

Offline John Stevenson

  • In Memoriam
  • Hero Member
  • *****
  • Posts: 1643
  • Nottingham, England.
Re: Any .pdf experts on the forum ?
« Reply #17 on: January 05, 2015, 05:10:42 AM »
No Andrews final copies were better than mine.
Mine were clean but it was the page numbers that were out as regards column position. Had it many times before in that if you do ..........11  it finishes in a different place than ..........99 because if the width if the numbers.
There should be an easier way in programs to centre justify columns.

Also many programs, word especially put in hidden justification characters that then further off set columns out of order.

And before anyone kicks off it's not a linux / windows issue but a program issue.
John Stevenson

Offline djc

  • Jr. Member
  • **
  • Posts: 85
Re: Any .pdf experts on the forum ?
« Reply #18 on: January 10, 2015, 12:12:00 AM »
...just need to resolve the rotation...

Sorry for the late post, but I didn't see it mentioned previously.

Look for a program called pdfSAM (pdf split and merge). Written in Java, it runs on most platforms and is free. Splits pdfs, extracts individual pages, odd/even pages, rotates pages, reorders documents, joins pdfs. Very highly recommended.

Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Re: Any .pdf experts on the forum ?
« Reply #19 on: January 10, 2015, 03:52:48 PM »
Thanks for that - I'll download it and have a try
Andrew Mawson
East Sussex

Offline Arbalist

  • Hero Member
  • *****
  • Posts: 673
  • Country: gb
Re: Any .pdf experts on the forum ?
« Reply #20 on: January 10, 2015, 07:44:12 PM »
As said earlier, PDF is built into the Mac OSX software but I didn't know until recently you could combine several PDF's into one document, very handy!

Offline awemawson

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8966
  • Country: gb
  • East Sussex, UK
Re: Any .pdf experts on the forum ?
« Reply #21 on: January 11, 2015, 04:15:06 AM »
That OK if you have a MAC  :clap:

There must be something built into each pdf page denoting orientation AND rotation. Different pdf editors produce different results, but all so far end up getting it wrong when the correctly rotated page is brought back into the body of the document - it's either upside down or still rotated :bang:

I've downloaded pdfSAM and am trying to get to grips with it - not the most user friendly program I've ever met  :clap:

I must get to grips with the format of pdf's - seems its rather complex.  :bugeye:
Andrew Mawson
East Sussex

Offline vtsteam

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 6466
  • Country: us
  • Republic of Vermont
Re: Any .pdf experts on the forum ?
« Reply #22 on: January 11, 2015, 10:32:37 AM »
Andrew, here's a list that might have something in it of use:

http://www.ubuntugeek.com/list-of-pdf-editing-tools-for-ubuntu.html

It says ubuntu, but many of the tools are available for other linux distributions and Windows.
I love it when a Plan B comes together!
Steve
https://www.youtube.com/watch?v=4sDubB0-REg