PDF to TIFF to TXT: Bash Script Automation

A few weeks ago I started into a project to take a scanned PDF image and turn it into text. At the time I was doing this manually by taking the PDF file and converting it to a TIFF (image file) and then running that through tesseract-OCR engine to output a TXT file. Then I would do some regular expressions on the file to pull out manual line breaks that the OCR process stuck into the file.

Since I was taking a class in Python at the time I thought I would do this using Python. But when I actually started coding the steps, I realized that a bash script would do everything I needed and I already knew how to do most of it. So that is what I ended up with.

#!/bin/bash
if [ $# -lt 1 ]; then
        echo "Useage: $0 {file in.pdf}"
        exit
fi
if [ ! -f $1 ];then
        echo "Filename given $1 doesn't exist."
        exit
fi
convert -density 300 $1 -depth 8 $1.tiff
tesseract $1.tiff $1
cat $1.txt | tr '\n' '*'| sed 's/\*\*/^^/g'| sed 's/\*/ /g'| sed 's/\^\^/\n\n/g' > $1.converted.txt
rm $1.tiff
rm $1.txt

I will walk you through what I understand about it. I did get help with the regular expression (line 12) and mostly understand it. But here is a stab at helping you see what is going on. And, I am certainly willing for anyone to offer suggestions to make the script better.

To use the script you invoke it with the name of the script and then the name of the PDF like this: pdf2tiff2txt.sh filename.pdf

  • Line 1: This says that the script is a bash script.
  • Lines 2-5: A failsafe to tell the user that they need to type in the command for the script (my script is called pdf2tiff2txt.sh) and then the location of the PDF that is to be converted.
  • Lines 6-9: A failsafe that if a non-existent file name is given then the user knows the file does not exist in the location they indicated.
  • Line 10: Converts the input file into a TIFF with a resolution of 300 dpi and an 8-bit color depth. It also outputs the original file name with a .tiff extension.
  • Line 11: Uses the tesseract OCR engine to convert the file from an image to a text file. The .txt file extension is automatically added by tesseract.
  • Line 12: The magical regex line. All carriage returns are converted to asterisks. Then double asterisks are converted to double beginning lines. Single asterisks are converted to spaces. The double beginning lines are converted to double carriage returns. Finally the new file is output as the original file name plus .converted.txt.
  • Lines 13 and 14: Deletes the intermediary files that were created. I don’t ever have need for these so I can safely delete them. This is the section where I think my script could be improved. I would guess there is a way to create the intermediary files as temporary files that get automatically deleted as soon as they are used.

I am currently using this in production and it does exactly what I need. I am very pleased with the output and want to thank my friend Brett for his help with the regex and the failsafe lines.

Imperial March On A Floppy

This is not a new project to the Raspberry Pi world, but since I gave myself a new Raspberry Pi model A+ for Christmas, I wanted to do a simple and fun project. I have actually been engaged in a couple of more complicated projects on the Raspberry Pi that I don’t completely understand yet. So this one is just a learning exercise to better understand how to control physical devices with the Raspberry Pi.

Setup and Code

I got my start in this project by watching a video by XtraPerianer and then reading his writeup about it. I don’t go into any details here about how to accomplish the task, I just wanted to show you my project and a couple of things I learned in the process. You need to visit XtraPerianer’s video and site to get the details.

I did run into one problem that should be noted. When I copied the C++ code for the song from the Raspberry Pi forum, there were 4 extra lines at the beginning of the code block that my compiler choked on. Make sure you don’t included these 4 lines when you make your own .cpp file.

# --------------------------------------
# Written by Scott Vincent
# 16 Feb 2014
# --------------------------------------

Of course the code author should get credit, but for the purposes of compiling the code you should eliminate these lines. At least it did not work for me to have these included. Admittedly, I am clueless as to proper C++ formatting. There may be something that I did wrong.

UPDATE: The problem is the use of # as a comment marker. Thanks to Tnwheeler for pointing out in the comments below the proper comment notation for C++ code.

Video

Here’s my video of the project with some annotations included.

What I Learned

I have done some GPIO programming with the Pi in the past. It has been a bit over a year and I don’t remember all the details. But, this was a simple refresher to get the software I needed for my new Raspberry Pi. The programming I did in the past (and what I have been studying for a few months) is Python. However, this project uses C++. The code has enough comments in it that I pretty much understand what is happening. Now I want to modify it and have my floppy play other songs.

One of my future projects will make heavy use of stepper motors. This is a good reminder of how they work and how to program them.

2014 Book Breakdown

Here is the breakdown of the 57 books I read in 2014. All the numbers below are how many books fell into that category.

Format

I was a little surprised by how few books I read on my Kindle. I really do prefer reading on it, but since I almost never buy books, I read in whatever format I can get them. If I were to actually spend money for a book (and had a format choice), then I would get them exclusively for my Kindle.

  • Paper Books: 29
  • Kindle/Electronic Books: 16
  • Audio Books: 12

Genre

I was able to tease out 8 major categories of books. I did have 1 book that overlapped categories. It was not strictly biographical, but it also was not what I would consider a plain history book. So the count adds up to 58 instead of the 57 that I read.

  • Communications/Business: 19
  • Religious: 12
  • Technology: 6
  • History: 5
  • Productivity: 5
  • No Category: 4
  • Fiction: 4
  • Biographical: 3

Months

  • December: 12
  • November: 8
  • 2 Months: 5
  • 3 Months: 4
  • 5 Months: 5

Ownership

  • Library: 31
  • Owned: 24
  • Borrowed: 2

Of the owned books, they broke down like this:

  • Free or given to me as a gift: 12
  • Purchased used: 9
  • Purchased new: 3

I guess you can see I don’t spend much for books even though I read quite a few. There are so many books that I already own that I have never read, I really shouldn’t spend so much time at the library. But it is so hard to resist the pull of the New Books shelf each week when we go.

We are members of 2 local libraries. The one we go to every Saturday is fairly well stocked, but seems so impersonal. Though we have been there most Saturdays for the last 3 years I still feel like we are walking into someone else’s library when we are there. I’ve never felt like the staff are friendly or personable. Their computer system has been in a constant upgrade process for 2 years and it almost never works as expected. I only remember asking for help finding a book one time and that was about 1 year ago. The lady pointed towards where the book should be. I had already looked and asked her if she could go help me look. She finally did. Though we did not find the book, it is still listed in the catalog as being on the shelf. I have told them twice that the book was missing, but they have not taken it out of their catalog or flagged it as being temporarily lost.

The library where I go during the week is in a small house. Almost too personal at about 1,200 square feet of total space (this includes stacks, offices and storage). I haven’t spent much time in there, but I know all the workers’ names and they act thrilled to help any patron try to find a book. I am excited that this during-the-week library is building a new 16,000 square foot building that will open in May of this year. That will be a more than a 10X size increase in the new building!

Over the next few days I will compile the groups of books and work on a few book reviews for you.

2014 Consumption — 2015 Creation

Today is January 1–a day typically known for goal setting and resolutions. For me I have been thinking about my 2015 goals for a couple of months. First let me tell you about 2014 (which I did not blog about my goals for the year). Then I will get into 2015’s goals.

2014 — A Year of Consumption

Some of the books I read in 2014
Some of the books I read in 2014

No, not a year of tuberculosis. That is a different kind of consumption than what I did in 2014. I made 2014 a year of reading. I know some people read a ton more books than I do; but for me, setting a goal is helpful to keep me reading. I did something similar in 2008 and 2009 when I had a goal of reading 800 pages a month. In 2008 the goal was an average for the year and 2009 was an attempt to not let any month drop below 800 pages.

For 2014 the goal was not a certain number of pages, but a certain number of books. I wanted to read 50 books for the year. My final total was 57 books. Total pages read was 13,275 (for an average of 1,106 pages a month).

I will go into much more analysis about the books in upcoming posts. Mainly I am breaking the analysis into smaller parts because of what it has to do with this year’s goals.

2015 — A Year of Creation

I have a big content creation goal for 2015. Of course I will still be reading through the year. I plan to write down each book I read (which I did better in 2014 than I have ever done before). It will be interesting to see if I actually read significantly less when I don’t have a specific goal.

Writing

Year of WritingThe bulk of my content creation will be in the form of writing. I plan to create at least 4 pieces of writing content per week. That will be spread over several websites. I have identified 8 places where I should be regularly creating content. That means that the 208 new pieces of writing won’t all be here. So it may not look like as much writing is going on if you just follow this one blog. This just happens to be the best place to talk about this and the place that can catch any random thoughts I have.

This is a total goal for the year and not necessarily a week-to-week or month-to-month goal. So if I don’t have 4 each week or 16 each month, I am allowing myself to catch up towards the end of the year. It is also possible that this goal is too meager. I will re-evaluate later and see if I need to increase the number as the year goes along. Based on the last couple of years, I don’t think this will be too low. I need to get back into more consistent writing.

Evaluation

Along with a writing goal, part of my creation will be evaluating my different websites. I want to seriously look at all the themes, plugins, functionality, and overall design of each one of my major websites. So each site will get an honest evaluation.

I want to standardize many of the plugins I use. Because I have built my various websites over the course of 15 years–with about 9 years focused on WordPress–I found the best plugin at the time for a particular job. But, that means that I am using about 4 different caching plugins and 3 different backup plugins (with some sites having no automatic backups at all). I want to unify my maintenance routine on the sites which will best be done by standardizing the way I work with each site.

Personal Goals

I have some other goals related to Bible study, health and ministry. Though I probably won’t post my Bible study goals explicitly, the fruit of those goals will show up on Genuine Leather Bible. BTW, in 2014 I was able to buy the .com domain for that website. I will probably keep the main content on .net, but now the .com will point traffic to the site. That has become, by far, my biggest source of traffic.

I had thought about having a minimum number of words that are needed to be considered a “piece of content,” but I never have problems writing too few words. I am tracking everything with the knowledge that if there is ever anything that is considered too short (300 words or less) those will be balanced out by the many blog posts and articles that will approach the 3,000 word mark throughout the year.

Happy 2015!

Use Tesseract OCR with PDF File

Goal — Copy Text from PDF Scan

If a PDF is created from a computer file then the text is embedded as part of the file. You can simply copy and paste the text from the PDF. But if the PDF is created from a scanned document, then the text in the PDF is essentially a picture and not text that can be copied and pasted. In my case I receive these PDF scans from missionaries’ prayer letters that need to be turned into blog posts or used in newsletters. I want to copy the text without having to retype the whole letter.

Setup Information

I am using Linux as the OS. The main software I am using to do the heavy lifting is Tesseract OCR. They have a Windows version. You can probably figure out a way to make most of these tools (or equivalents) work in a Windows environment. But, if you are using Windows, you probably don’t do this geeky kind of stuff. You are still probably retyping any document you need to do something like this on..

Besides Tesseract OCR, I am using ImageMagick to do image conversion. They also have a Windows version of their program.

Steps

You need to take the original PDF and convert it into an image file using ImageMagick. But, it is not as simple as issuing the convert command. You have to give it a couple of other parameters. One is that the file must be an 8 bit color scheme or Tesseract will choke on it. Also it needs to be scaled up to sufficient dpi (dots per inch). ImageMagick’s convert command will output a 72 dpi file by default. My scanner scans at 300 dpi by default, so I can easily convert the PDF to a 300 dpi image which is enough to get a decent OCR output.

Details

CD into the directory where your PDF is or you will need to add the paths to the following commands.

Convert PDF

convert -density 300 file.pdf -depth 8 file.tiff

The string equals: use imagemagick to create a 300 dpi image at a color depth of 8 bits from file.pdf into a file named file.tiff in the current folder.

Run Tesseract OCR on file.tiff

tesseract file.tiff OutputFileName

This string equals: Do OCR (optical character recognition) using Tesseract on file.tiff and output it to a file called OutputFileName.txt in the same folder.

Future Project

I plan to turn this into a Python script to simplify this into a single step [it became a bash script instead]. I am learning Python at the moment and don’t know all the pieces I need to know to make the script. But, ultimately I will use Python to do RegEx (regular expression) find and replace on the end of lines so that paragraphs are maintained in the final output and there are not a large number of forced line breaks. Then I can just open the .txt file in a text editor and copy and paste the contents into they website.

[The blog post from Kiirani that put me on the right track.]