Use Tesseract OCR with PDF File

Goal — Copy Text from PDF Scan

If a PDF is created from a computer file then the text is embedded as part of the file. You can simply copy and paste the text from the PDF. But if the PDF is created from a scanned document, then the text in the PDF is essentially a picture and not text that can be copied and pasted. In my case I receive these PDF scans from missionaries’ prayer letters that need to be turned into blog posts or used in newsletters. I want to copy the text without having to retype the whole letter.

Setup Information

I am using Linux as the OS. The main software I am using to do the heavy lifting is Tesseract OCR. They have a Windows version. You can probably figure out a way to make most of these tools (or equivalents) work in a Windows environment. But, if you are using Windows, you probably don’t do this geeky kind of stuff. You are still probably retyping any document you need to do something like this on..

Besides Tesseract OCR, I am using ImageMagick to do image conversion. They also have a Windows version of their program.

Steps

You need to take the original PDF and convert it into an image file using ImageMagick. But, it is not as simple as issuing the convert command. You have to give it a couple of other parameters. One is that the file must be an 8 bit color scheme or Tesseract will choke on it. Also it needs to be scaled up to sufficient dpi (dots per inch). ImageMagick’s convert command will output a 72 dpi file by default. My scanner scans at 300 dpi by default, so I can easily convert the PDF to a 300 dpi image which is enough to get a decent OCR output.

Details

CD into the directory where your PDF is or you will need to add the paths to the following commands.

Convert PDF

convert -density 300 file.pdf -depth 8 file.tiff

The string equals: use imagemagick to create a 300 dpi image at a color depth of 8 bits from file.pdf into a file named file.tiff in the current folder.

Run Tesseract OCR on file.tiff

tesseract file.tiff OutputFileName

This string equals: Do OCR (optical character recognition) using Tesseract on file.tiff and output it to a file called OutputFileName.txt in the same folder.

Future Project

I plan to turn this into a Python script to simplify this into a single step [it became a bash script instead]. I am learning Python at the moment and don’t know all the pieces I need to know to make the script. But, ultimately I will use Python to do RegEx (regular expression) find and replace on the end of lines so that paragraphs are maintained in the final output and there are not a large number of forced line breaks. Then I can just open the .txt file in a text editor and copy and paste the contents into they website.

[The blog post from Kiirani that put me on the right track.]

Why DRM Frustrates Legitimate Users

I have never been a fan of DRM (Digital Rights Management). This is the system that is supposed to stop people from illegally sharing files across the Internet. I don’t know of any DRM that has completely stopped file sharing. It is trivial to do a search on the Internet for the file you want and download it. DRM hasn’t accomplished its goals.

It has, however, managed to frustrate and punish legitimate users. Here is my story of a book I acquired legitimately but yet find almost impossible to enjoy. To the point I have stopped reading it.

A Trip to Amazon

I love Amazon. I am a prime member. I buy what I can at the site when it is cheaper, which it isn’t always. I also love my Kindle. I love their customer service. But this isn’t about Amazon, it is about a book I wanted to buy there.

The book is The $100 Startup. I have heard several podcasters talk about the book and I have read quite a few reviews. I took a trip to Amazon to get the book for my Kindle. A few things stood out as soon as I got there. First the hardcover version of the book is only $13! That’s a great price for all that paper and ink. Independent bookstores are selling the book for less than $11 through the Amazon marketplace. That’s an even better deal!

But I don’t want paper, I want a Kindle version. I know plenty of people say that the ebook version is not as good as paper. I used to be one of those. However, I now see the digital version as being superior. I can highlight passages and take notes on my Kindle. I can then view all my notes and highlights online and use that information anywhere whether my Kindle is with me or not. Try that with paper. One other thing about digital is that I am paying for the content and not the paper.

Checking out the Kindle version of the book I saw the price was $11.99. That’s more than what I could buy a paper and ink version of the book for. I then noticed that Random House was the publisher. They have a history of setting prices at Amazon for their ebooks. Though they are not part of the Department of Justice’s lawsuit against Apple and five publishers for collusion, the result is the same in that they set the price of their books and not Amazon.

I am spoiled by Amazon’s price of $9.99 for Kindle books and I don’t like paying more than that. Rarely do I even pay that much for a book since I can often find good sales on books I want. And I certainly don’t like paying just $1.15 less than the hardback version of the book (or less if I buy it from a third party).

A Trip to the Library

I looked up the book at our local library hoping to score a copy for free. I did not find a physical copy there, but they offer it through their digital library system which is handled by Overdrive. “Great!”, I thought. That would be even better. I can take notes on my Kindle and have the book in a format I prefer.

When I got home I logged into the digital library system and found the book. Disappointingly it was only available in EPUB format and not the Kindle format. I didn’t think it would be that big of a deal to get the book as an EPUB and then convert it to .mobi (the format for the Kindle).  However, after waiting in line for a couple of weeks to get the notification that it was my turn to borrow the book (yes, you still have to wait in line for other people to “finish” reading the book and “return” it to the library) I eagerly downloaded the book to my computer.

Opening the Book

DRM - No One AdmittedThe file wouldn’t open in anything I had as a reader. The file was DRMed with the Adobe Digital Editions system (ADE). This means you have to have some type of approved reader that will allow you to authenticate with an ADE account. There is no such reader available for Linux that I could find. So no way to read the book on my computer or convert it (without breaking the DRM and facing prison time for a DMCA violation).

Here is the problem with DRM. I legally obtained the book. I have done nothing inappropriate to acquire the book. Yet, because of DRM I am not allowed to read the book on the hardware I have. From my understanding, if I had a Barnes and Noble Nook eReader which has ADE on it, I still would not be able to transfer the file through my computer because I am running Linux. The file I got from the library was not the book itself that could be placed on the Nook. It was an authentication file that has to be approved by Adobe which then lets me download the book to place on the reader. All of which would have been impossible as a Linux user.

Using My Phone

I downloaded the Overdrive Media Console (the Overdrive ebook reader) for my Android phone. Thankfully I could download the book using Overdrive’s software. I even started reading the book.

The reading experience on Overdrive’s Media Console was worse than a paper book for me. I have not found any way to make notes or highlights within the text of the book. Right at halfway through the book the author gives a 39 step checklist. The perfect kind of thing you would want to highlight and save for future reference. I can’t do it. I don’t even have the option of sticking my phone on the copy machine and grabbing the list since there are so few words that appear on a page with such a small screen. The list takes up 20 screens worth of text. I don’t want to make 20 pages worth of copies to get this seemingly valuable list.

On top of that, almost every time I open the book using Overdrive’s software it opens to the page previous to the one I was reading when I stopped. I say “almost every time” because 3 times so far I have been returned to the start of a chapter and had to click through several pages before getting back to where I left off.

The app is slow too. It takes 20 seconds to open the book. Then each time I change chapters it takes 20 seconds to load in the next chapter. That is just opening the book once the software is running. My Kindle takes just under 2 seconds to go from an off state to reading a book.

My solution? I am giving up on The $100 Startup. Chris, I am sure your book is a fine one. I have heard you interviewed by several podcast hosts that I respect; however, to legally read your book within my requirements of price and convenience I just can’t do it. I spent 2 weeks waiting for the book from the library. I have had the book for 11 days and am frustrated by the reading experience (which has little to do with the quality of the book). I’m done with it.

Circumventing DRM

I will admit that I did a little digging into the process of breaking the DRM on the book. It seems trivial. I have never done it on an EPUB, but I have converted a few Kindle titles that I own that I wanted to read on another device. For the Kindle books I have done it takes importing the book into Calibre with some special plugins and clicking a button.

For The $100 Startup it only took a few seconds on Google to find an Kindle formatted copy on the Internet for free. I could illegally obtain the book for my Kindle with much less hassle than the legally obtained DRM version of the book. Plus I would have a much better reading experience. However, I won’t do that. I am happy to pay the author for the content at a fair price (as determined by me). What I don’t want to do is pay a publishing company essentially the same price for the content that they are charging for the content, paper, ink, pretty cover and something I can put on my bookshelf.

Again, I don’t mind paying the author for the content. The truth is though, with the pitiful amount he will be paid by the publishing company for each copy sold, he could probably self publish the Kindle version, sell it for $3 and make 250% more per copy than he does currently. This sounds like it would be more in keeping with the spirit of a $100 startup than using a traditional publisher that has no interest in the author–only in their pocketbooks.

What’s your thoughts on DRM?

A Bug in Android App Lock That Saved Me

I have a wonderful program that I use to secure certain applications on my Android phone from prying eyes. It is called App Lock. It is a simple screen that comes up prompting for a passcode when trying to access certain applications. I like this for the simple fact that I can secure some programs without locking down the whole phone.

App Lock Screenshot

Up until this morning the App Lock app had worked without any problems. But today I turned on the Accessibility features of my Android phone to play with a new keyboard. When the Accessibility screen comes on it puts a layer over the bottom half of the screen which lets you navigate the device with gestures. In doing so, when I tried to go back to the settings in my phone to turn off the Accessibility option, I could no longer press the numbers on the number pad. For some reason the passcode screen would not move up from behind the gesture screen to allow me to put in the numbers.

I thought that I was locked out of my phone and would have to somehow wipe the system and start over. That was not a prospect I was looking forward to.

In my research to find a solution I came across a security flaw in the App Lock software. This is a serious flaw and I assume that the App Lock guys will work to fix the problem. Until then, maybe this will help someone else get control of their device again. I realize that bad people could get this information and use it to exploit someone’s phone. I regret that the possibility exists, but I am personally thrilled that this security bug saved me from having to rebuild my phone setup from scratch.

Here are the steps that I was able to take to get control of the device again:

  • Turn on App Lock and press Protection list
  • Press the Home button on your phone
  • Turn App Lock back on again
  • Press FAQ
  • Press the back button

This will reveal your list of applications that are blocked and give you a chance to turn the block off. For me that meant that I could remove the block from my Settings menu and make the changes I needed to make. For others this means that their information isn’t as well protected as they would hope.

Obviously the makers of the App Lock software may fix the problem by the time you read this. That is a good thing…unless you are locked out of something because of turning on the Accessibility feature on your Android phone.

Linux Turns 20 Years Old and I Celebrate 10 Years With Linux

I heard on a podcast today that Linux is celebrating 20 years this year. The 0.01 version of the Linux kernel was launched in September of 1991. That makes Linux 20 years old this year.

I first heard about Linux in 1995. I was immediately drawn to it. I think it is something about my personality that wants to do things differently than everyone else. I am usually the first of my friends to try new things. Sometimes those new things become very popular and I have to move away from them to find something different (my recent move from the iPhone to Android). Sometimes that new and different thing never catches on and dies a quick and painless death (Sharp Zaurus which ran Linux).

Linux is one of those things that I have been able to find a group of sympathetic friends who share my passion and frustration. Linux will probably never be mainstream by itself, but there are some pretty neat technologies that are built on top of Linux. While not strictly Linux, the guts of Mac OS X shares the same roots as Linux. Google’s Android platform is built on Linux.

I remember spending hours with the dial-up modem trying to download different distributions of Linux to try out. I would tie up the phone line as soon as I got home from work and leave the connection running all night. If there was ever a need for bittorrent technology it was back in the dial-up days.

From 1997 to 2001 I played with Linux heavily. I was never willing to commit to it as my main OS, but I spent a lot of time with it. I did not trust my work to Linux back in those days, but I probably dedicated more of my computer time to Linux.

In 2001 I took the plunge and loaded Linux as my main OS. Since then I have used it exclusively for work. I continued to dual boot for several years because of having certain games or specific programs that I wanted to use. Until recently I was booting Windows in a virtual machine because there were only one or two programs I wanted to use. Since I loaded on the latest version of Kubuntu Linux (11.04) I didn’t even bother to rebuild my virtual machine.

I am celebrating 10 years of Linux being my main OS and Linux is celebrating 20 years. It has been a fun journey.

Linux Wireless Driver for Gateway 6750

I have consistently had problems getting the wireless driver for my Gateway M-6750 notebook working in Linux. This is because the hardware is a designed for Windows. There are no native Linux drivers for it. However, there is a nice little program called ndiswrapper. This is a program that I had known about previous to getting this computer, but had never had to use it. All my other computers had wireless cards with Linux drivers.

This is not a new problem for me. I have had this problem since I bought the computer in early 2008. But, I go through the process of having to find instructions every time I install a new OS on my computer. Therefore, I am writing down the steps here for my own benefit in the future (assuming I can remember to look at my own blog when I need to do this again).

I got this set of steps from a thread on the Ubuntu Forums. This assumes you already have the Windows driver extracted into a folder and that you are running these commands from that folder. I remember (3.5 years ago) finding the driver and extracting it, but I don’t remember any of the process that I went through to do it. If you are reading this and need help, then you will have to look elsewhere.

lspci -nn
sudo ndiswrapper -i NetMW14x.inf
sudo ndiswrapper -a 11ab:2a08 netmw14x
sudo ndiswrapper -l
sudo ndiswrapper -m
sudo depmod -a
sudo modprobe ndiswrapper

These steps do the following.

lspci -nn gives you the name of your network adapter. In my case it says (along with a pile of other output): 02:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. Device [11ab:2a08] (rev 03). The important thing to note is the Device [11ab:2a08]. If you have a different computer than the Gateway 6750 then your output would be different. You will need to use the device id (known as the devid in ndiswrapper) for your own hardware.

The next line installs the .inf file for my driver. If you have the same computer, it will be the same thing, but you need to find the .inf file for your hardware.

The following line associates the driver with the particular hardware. This is where using the wrong devid is (apparently) potentially harmful. At least, I gather it is harmful based on the warnings I read.

The -l option gives you a list of drivers you have installed. This should only be the one you just now installed. The -m saves your configuration.

I do not know understand the depmod command, but it has something to do with preparing things for the next command—modprobe.

The final command inserts the driver module you created into your system so that it can actually be used.

For me, that was it. I was then able to look into the network manager icon in my system tray and everything worked as expected. Hopefully it works well for you. If not, I am not sure I can be much help. You can dig through the forum post where I got this information and see if you can find help for your specific issue.

That got the driver working, then I needed to modify the /etc/modules file and add the line ndiswrapper to the end of the file. That will insert the module at every boot up.

When the computer goes to sleep the ndiswrapper module breaks. It needs to be reloaded. I fixed this by creating a file called /usr/lib/pm-utils/sleep.d/0000wireless. This file contains the following:

#!/bin/sh
# reload ndiswrapper to get wireless to recover properly
case "$1" in
resume|thaw)
rmmod ndiswrapper
modprobe ndiswrapper
;;
esac

The file needs to be made executable with a sudo chmod a+x usr/lib/pm-utils/sleep.d/0000wireless command. It will awaken out of sleep like normal after that.