Use Tesseract OCR with PDF File

Goal — Copy Text from PDF Scan

If a PDF is created from a computer file then the text is embedded as part of the file. You can simply copy and paste the text from the PDF. But if the PDF is created from a scanned document, then the text in the PDF is essentially a picture and not text that can be copied and pasted. In my case I receive these PDF scans from missionaries’ prayer letters that need to be turned into blog posts or used in newsletters. I want to copy the text without having to retype the whole letter.

Setup Information

I am using Linux as the OS. The main software I am using to do the heavy lifting is Tesseract OCR. They have a Windows version. You can probably figure out a way to make most of these tools (or equivalents) work in a Windows environment. But, if you are using Windows, you probably don’t do this geeky kind of stuff. You are still probably retyping any document you need to do something like this on..

Besides Tesseract OCR, I am using ImageMagick to do image conversion. They also have a Windows version of their program.


You need to take the original PDF and convert it into an image file using ImageMagick. But, it is not as simple as issuing the convert command. You have to give it a couple of other parameters. One is that the file must be an 8 bit color scheme or Tesseract will choke on it. Also it needs to be scaled up to sufficient dpi (dots per inch). ImageMagick’s convert command will output a 72 dpi file by default. My scanner scans at 300 dpi by default, so I can easily convert the PDF to a 300 dpi image which is enough to get a decent OCR output.


CD into the directory where your PDF is or you will need to add the paths to the following commands.

Convert PDF

convert -density 300 file.pdf -depth 8 file.tiff

The string equals: use imagemagick to create a 300 dpi image at a color depth of 8 bits from file.pdf into a file named file.tiff in the current folder.

Run Tesseract OCR on file.tiff

tesseract file.tiff OutputFileName

This string equals: Do OCR (optical character recognition) using Tesseract on file.tiff and output it to a file called OutputFileName.txt in the same folder.

Future Project

I plan to turn this into a Python script to simplify this into a single step. I am learning Python at the moment and don’t know all the pieces I need to know to make the script. But, ultimately I will use Python to do RegEx (regular expression) find and replace on the end of lines so that paragraphs are maintained in the final output and there are not a large number of forced line breaks. Then I can just open the .txt file in a text editor and copy and paste the contents into they website.

[The blog post from Kiirani that put me on the right track.]

April Fool’s Day Fun

Upside Down Day

I’ve eaten leftover pizza for breakfast many times. But yesterday, my wife made pizza for breakfast. She didn’t do any crazy toppings on it. Just the fact that we had fresh baked pizza for breakfast was enough. She even served me a glass of Diet Dr Pepper to go with the pizza. The kids loved it.

Picture of a slice of pizza

Picture of breakfast food

My mom asked if that meant we would have cereal for supper. I told her I did not know, but that my wife had sent lunch to the office with instructions to not peek until lunch time. Well, that is when the cereal came out. Everyone at the office enjoyed the story of pizza for breakfast which explained the strange lunch.

For supper last night we had eggs, sausage, hash browns, and cinnamon rolls. Yum!

But the biggest prank of the day was what happened at the office.


It all started with an innocent phone call on Monday evening from our African Director in the office. I had been in the office late that night so he called to see if I was still there. I wasn’t but my wife and I were headed back to the office area later in the evening for a friend’s violin recital. I said I could take care of whatever he needed.

That is how the plan to trash the financial secretary’s office came about.

The Deed

Picture of an office with papers on the floorWe (my wife and I) took a couple of thick file folders from my office and tossed the contents all over the floor of the financial secretary’s office. We then opened several file drawers and randomly pulled file folders out of place so it looked like the files on the floor came out of her filing cabinets. Chairs and lamps were overturned as well as desk drawers were opened. Without getting violent and breaking anything, it looked pretty well tossed.

The big boss in the office is staying in a trailer on site. Since the financial secretary usually gets to the office before everyone else, we thought it was best to alert the boss to a possible frantic visit early of April 1st.

Show Time

When the financial secretary called me—why did she suspect me in the first place?—yesterday morning she was very calm. “Peach, are you messing with me?” I innocently responded that I might be, but I needed to know more information as to what the accusation was about. She told me about her office and I asked her if the alarm had gone off. She said she didn’t think so since she had to turn it off when she came into the building (I didn’t want the building to be un-protected for a simple prank). I then told her she should call the African director to see if the alarm company or police had called him because of the alarm going off in the night. She was already at the boss’ trailer and he denied any knowledge of the events.

Picture of me with an ice cream bucket on my headBecause she had convinced herself that I had done something to her office, she was pretty calm when we started our phone conversation. However, by the time she got off the phone with me and got ready to call the African director, she was fairly well worked up. I’m glad I wasn’t there in person or I could not have kept up the ruse.

I heard that the conversation between the financial secretary and the African director resulted in me getting ratted out pretty quickly. He could not string her on much further without fear of serious retribution (which I still somewhat expect).

The brains of the operation—I’m just the muscle—called me after he got through talking with her and we had a good laugh. After giving her a few minutes I sent a text message asking if she was OK. Her reply was, “Yes and congrats to the best April fools joke on the planet.” Then she proceeded to tell me why it was a terrible day to pull a joke on her (first of the month is the busiest day for her outside of the day she has to close out the books for the month). However, she was relieved to find out that all the paper on the floor belonged to me and that she did not have to figure out where it all belonged.

Needless to say, when I went into the office yesterday, I felt like I needed to protect myself. As defense I wore an ice cream bucket on my head and safety goggles. I also carried a large wooden stick in case I needed to get aggressive in protecting myself.

I am almost afraid to go into the office today.

Strange Dream About a Stolen Hard Drive

A few nights ago I had a strange dream, a nightmare really. It was that someone had broken into my house and stole a hard drive from one of my computers. The strange thing is that they did not touch my newer computers. They didn’t even go to the computers that had really important data on them. They went for the hard drive on my favorite computer.

The machine they targeted—which happens to be the one I am using to write this—is a 6 year old notebook computer on which I have recently wiped the drive and installed the latest version of Linux Mint. I have an almost empty drive with 220 GB of the 250 GB hard drive free. On top of that, I am using this computer as a cloud based system. All of my files on this machine are being saved to either Google Drive or Dropbox. There is nothing that lives exclusively on this machine. So if it were stolen or busted, I would lose nothing.

The Dream

I am not sure what kind of anxiety I was having with this computer that might have caused the targeted nature of this dream.

The bad guys broke into the house and broke open the computer. It was almost like they went in from the keyboard side of the notebook and ripped the hard drive out violently. There was a rectangular hole in the keyboard where the hard drive used to be. Strange dream.

The Result

But it affected me in an odd way. I immediately started thinking about by backup plan and password strategy. Over the next two days I ended up writing three articles at Missionary Geek about passwords and protecting them.

image of a lock and the word passwordsThe first is about how to build good unique passwords that you can remember but that are complicated enough to be secure. You have to avoid words that are found in a dictionary and you should not use the same password at multiple sites.

The second article was about using password managers. I use LastPass, but there are other good password managers. Since my wife and I share multiple accounts, we also share our LastPass vault. When my friend’s wife died suddenly last year, I thought much about the nightmare it would be for my wife to try and get into accounts to either take possession of them herself, or close them down. Because of sharing a LastPass vault, she is able to get into any of my 120+ accounts that are managed there. Besides that convenience, password managers help you generate unique passwords that are stronger than you would probably make on your own since there is no need to memorize 120+ passwords anymore.

The final article in the series is written for those who travel and have to use computers that they don’t completely control. In these situations you should always be leery of keylogging hardware or software. That article has a pretty solid strategy for avoiding having your passwords stolen by keystroke loggers.

I have not come up with a series of article about backing up your hard drives yet, but in the mean time you should do some reading and put a plan in place if you are not doing something already.

I hope my nightmare can be a help to you in building a better password strategy.

WordPress Dashboard Loses Formatting

View of broken WordPress Dashboard

I have had a problem when setting up WordPress in a test environment where the obvious issue was that the WordPress Dashboard (control panel area) loses its formatting. It looks like the CSS style sheet (is that redundant?) is not being read. I say this is the obvious issue, but there are other problems that I consistently had with this symptom that I hadn’t connected to the same problem.

Here is the other major issue that, if you experience it, would tell you that you are having the same problem as I: when trying to do various tasks on the site you end up with your domain being duplicated and receiving a 404 error. You can still get to just about any page you want, but you have to take out the duplicated URL. Let’s use as the example URL. If you try to log in to the site at you can pull up the page. But when you put in your username/password you will end up at a site like You see that the is duplicated in front of where the system takes you next. If you manually delete everything from the first up to the next and then reload the page, the site will load as expected until you submit more information and it needs to reload the site. It duplicates the URL in front of where it is trying to go.

This is caused by putting an incorrect address in your “WordPress Address URL” within the General tab of the Settings section of the WordPress Dashboard. I had been incorrectly identifying my domain name (in my case Instead of the domain itself you must put in the http:// protocol handler. Or, at least that is true if you are using an IP address like I was. So by putting in everything now seems to be working correctly.

I am now able to do testing on my new internal server before I point DNS to the new site and make it go live.

At the recent New Media Expo in Las Vegas, several podcasters got to interview one another and do some cross-promotion. Two of my favorite podcasters bumped into one another; Len Edgerly from The Kindle Chronicles met Mignon Fogarty, a.k.a. Grammar Girl.

Grammar GirlGrammar Girl’s podcasts are all scripted which adds a little stiffness to her speech, but has the wonderful benefit of packing a ton of information into a short podcast–something I absolutely love. I have heard Mignon speak without notes before and love the honesty and wonder you can hear in her voice. A highlight for me was when she called me in Argentina and interviewed me for one of her podcasts: Behind the Grammar.

In the episode of The Kindle Chronicles in which Len interviews several people from New Media Expo, he has a short conversation with Grammar Girl. I think this conversation captures their personalities well. The conversation with Mignon starts at 27 minutes and 52 seconds into the recording. Though, if you have the time, really everything from 18:38 with Cliff Ravenscraft until the end of the episode is worth listening to.

Len and I met at South By Southwest a couple of years ago (when I also met the eBook Ninjas). I was already a fan of the podcast, but being able to sit and eat breakfast with him was a great opportunity.

