Goal — Copy Text from PDF Scan
If a PDF is created from a computer file then the text is embedded as part of the file. You can simply copy and paste the text from the PDF. But if the PDF is created from a scanned document, then the text in the PDF is essentially a picture and not text that can be copied and pasted. In my case I receive these PDF scans from missionaries’ prayer letters that need to be turned into blog posts or used in newsletters. I want to copy the text without having to retype the whole letter.
I am using Linux as the OS. The main software I am using to do the heavy lifting is Tesseract OCR. They have a Windows version. You can probably figure out a way to make most of these tools (or equivalents) work in a Windows environment. But, if you are using Windows, you probably don’t do this geeky kind of stuff. You are still probably retyping any documentÂ you need to do something like this on..
Besides Tesseract OCR, I am using ImageMagick to do image conversion. They also have a Windows version of their program.
You need to take the original PDF and convert it into an image file using ImageMagick. But, it is not as simple as issuing the convert command. You have to give it a couple of other parameters. One is that the file must be an 8 bit color scheme or Tesseract will choke on it. Also it needs to be scaled up to sufficient dpiÂ (dots per inch). ImageMagick’s convert command will output a 72 dpi file by default.Â My scanner scans at 300 dpi by default, so I can easily convert the PDF toÂ a 300 dpi image which is enough to get a decent OCR output.
CD into the directory where your PDF is or you will need to add the paths to the following commands.
convert -density 300 file.pdf -depth 8 file.tiff
The string equals: use imagemagick to create a 300 dpi image at a color depth of 8 bits from file.pdf into a file named file.tiff in the current folder.
Run Tesseract OCR on file.tiff
tesseract file.tiff OutputFileName
This string equals: Do OCR (optical character recognition) using Tesseract on file.tiff and output it to a file called OutputFileName.txt in the same folder.
I plan to turn this into a Python script to simplify this into a single step [it became a bash script instead]. I am learning Python at the moment and don’t know all the pieces I need to know to make the script. But, ultimately I will use Python to do RegEx (regular expression) find and replace on the end of lines so that paragraphs are maintained in the final outputÂ and there are not a large number of forced line breaks. Then I can just open the .txt file in a text editor and copy and paste the contents into they website.
[The blog post from Kiirani that put me on the right track.]
5 thoughts on “Use Tesseract OCR with PDF File”
To install Tesseract OCR on Debian type this in a command line:
sudo apt-get install tesseract-ocr
You can try this free online ocr tool, it can save the recognized text to searchable PDF file.
I may try that sometime. My method has been working perfectly for me. Takes me just a few seconds to get all the text I need now that I have figured out my workflow.
Future Project ??
sudo apt install ocrmypdf
ocrmypdf input.pdf output.pdf