PDF to TIFF to TXT: Bash Script Automation

A few weeks ago I started into a project to take a scanned PDF image and turn it into text. At the time I was doing this manually by taking the PDF file and converting it to a TIFF (image file) and then running that through tesseract-OCR engine to output a TXT file. Then I would do some regular expressions on the file to pull out manual line breaks that the OCR process stuck into the file.

Since I was taking a class in Python at the time I thought I would do this using Python. But when I actually started coding the steps, I realized that a bash script would do everything I needed and I already knew how to do most of it. So that is what I ended up with.

if [ $# -lt 1 ]; then
        echo "Useage: $0 {file in.pdf}"
if [ ! -f $1 ];then
        echo "Filename given $1 doesn't exist."
convert -density 300 $1 -depth 8 $1.tiff
tesseract $1.tiff $1
cat $1.txt | tr '\n' '*'| sed 's/\*\*/^^/g'| sed 's/\*/ /g'| sed 's/\^\^/\n\n/g' > $1.converted.txt
rm $1.tiff
rm $1.txt

I will walk you through what I understand about it. I did get help with the regular expression (line 12) and mostly understand it. But here is a stab at helping you see what is going on. And, I am certainly willing for anyone to offer suggestions to make the script better.

To use the script you invoke it with the name of the script and then the name of the PDF like this: pdf2tiff2txt.sh filename.pdf

  • Line 1: This says that the script is a bash script.
  • Lines 2-5: A failsafe to tell the user that they need to type in the command for the script (my script is called pdf2tiff2txt.sh) and then the location of the PDF that is to be converted.
  • Lines 6-9: A failsafe that if a non-existent file name is given then the user knows the file does not exist in the location they indicated.
  • Line 10: Converts the input file into a TIFF with a resolution of 300 dpi and an 8-bit color depth. It also outputs the original file name with a .tiff extension.
  • Line 11: Uses the tesseract OCR engine to convert the file from an image to a text file. The .txt file extension is automatically added by tesseract.
  • Line 12: The magical regex line. All carriage returns are converted to asterisks. Then double asterisks are converted to double beginning lines. Single asterisks are converted to spaces. The double beginning lines are converted to double carriage returns. Finally the new file is output as the original file name plus .converted.txt.
  • Lines 13 and 14: Deletes the intermediary files that were created. I don’t ever have need for these so I can safely delete them. This is the section where I think my script could be improved. I would guess there is a way to create the intermediary files as temporary files that get automatically deleted as soon as they are used.

I am currently using this in production and it does exactly what I need. I am very pleased with the output and want to thank my friend Brett for his help with the regex and the failsafe lines.

3 thoughts on “PDF to TIFF to TXT: Bash Script Automation”

    1. If you want to build it, I can post the details here. I don’t use Windows on a regular basis, so I don’t have any interest in working on a Windows version.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.