The Shop > Software Tools
Any .pdf experts on the forum ?
awemawson:
The structure of a pdf file is a mystery to me. The "scanned to searchable" pdf's I understand are just an image, not characters as such, but how then they can be searchable I don't know.
JS has done an amazing job cleaning up those pdf's but he's not yet revealed to me how he did it.
I've spent some time this evening playing with a scanned and ocr'd copy into word format - which was actually amazingly good but with the expected errors where the characters aren't legible. However the way it's done it's formatting and lay out is a nightmare to edit - still can't get all the numbered sub paragraphs lined up :( It probably would be quicker just to generate a new Word document and then convert to pdf rather than fathom the programs peculiarities !
John Stevenson:
A bit like those raster to vector drawing programs you get. Far quicker to actually draw over the top.
I took the pdf's Andrew sent and brought them into Fine Reader V11 Pro and basically OCR'd them again. As you would expect loads of dross got thrown up but FR11 isn't bad at recognising but you do have to hold it's hand and say delete, ignore, delete etc.
Only problem then is like word, but nowhere near as bad it presents it's own formatting problems like the page numbers at the end of index lines don't match up and putting a few spaces or .... in then totally shags the whole document up to put a none too fine a point on it.
Soooooooooooooo, save as pdf, import into Serif Page Plus V8 which can open native pdf's and using PP desktop publishing features you can get the spacing right etc, etc and save as a PDF.
The reason you can't go into PP straight away is it saves as imported and Andrews pages had a yellow and thumbprint background colour but FR11 can save as just text and omits the background if you want.
Swarfing:
Can i suggest booting with a linux distro and use one of the pdf readers from there. I was able to clean up a pdf with the same problem a few years ago. I can't remember the reader/ writer i used but a few of them come with tools to do it. I was able to remove the water marks added to the files as well.
SwarfnStuff:
For what it's worth from my very limited use of OCR software. If you can scan your page and save as a BMP you can then use whatever image software you prefer to clean it up to your liking. Re-Save in BMP format and your OCR should recognise it as a clean page. (At least mine did.) Perhaps Jpeg would work similarly, I have not tried that.
This coming from a bloke whose scanner is not recognised nor supported by Win - 7. Linux of the Mint version however has no problem with it. Why ditch a perfectly good (for my use) scanner just to please Windows? :Doh:
John B
awemawson:
Well thank you all for your input on this - so now I've solved the 'drop the grubby background' issue and just need to resolve the rotation and cropping to A4 issue on about four pages.
This is the cleaned up pair of pages that I first posted:
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version