As part of a broader data science project, I recently had the chance to undertake a digitisation project to augment the structured dataset we have for analysis. The project turned out to be quite instructive and I came away with a few lessons, lessons I hope to share in this blog piece.
The actual digitisation problem is easy enough to state: take a scanned hospital bill like the following
and extract all the service items into a table like the following:
There are several technical challenges:
- The bills come in different formats (1-2 formats per hospital).
- The scan quality and orientation of the images are not uniform.
- There is a mix of English and Chinese text in most bills.
- There is a need to avoid accidentally picking up names to preserve privacy.
- The image dataset is large so the digitisation script needs to be fast, parallelisable, and easy to modify.
After playing around with the documents and several Optical Character Recognition (OCR) tools for a while, I settled on the following solution workflow:
For Image Enhancement, I adopted ImageMagick, a popular open-source image manipulation software, to convert the supplied scanned images into gray-scale TIFF format files with the right-level of depth, plus image sharpening when necessary. The reason I picked ImageMagick is because it is a command-line tool that allows easy scripting and parallelisation, unlike alternatives like GIMP that rely on a GUI front-end.
One of the interesting insights I gained from this whole digitisation project is that converting image files into the “right” format turns out to be the single most important factor in OCR success, and this step is a bit of an art that took many trials and errors to get right. (This may be related to the OCR tool selected in this project, which is discussed later.) For the benefit of others, the exact command I used is this:
convert -density 300 input.tiff -type Grayscale -compress lzw -background white +matte -depth 32 output.tiff
For the Optical Character Recognition step, I adopted Tesseract, a popular open-source OCR software, to extract English characters from the enhanced image files. We actually evaluated several other mature open-source OCR software, including GOCR and OCRAD, and found Tesseract to be the most accurate on a sample of English documents we have. Some commercial software like ABBYY OCR are known to be better but we decided to stick with open-source software in this project.
It is worth noting that Tesseract supports multiple languages, including Chinese (both simplified and traditional), and one can specify multiple language options when running Tesseract. However, I learned that the multi-language detection mode doesn’t really work well when you have languages with vastly different character sets in a document. The reason is that detection of bounding boxes (one for each character), a key step in OCR, can get affected when you have characters that need very different types of bounding boxes in a document.
A useful option of tesseract that is not turned on by default is the hocr option, which produces not only the detected text in a document but also the coordinates. This is a useful piece of information if you need to do a lot of post-processing on extracted text to correct OCR errors. Unfortunately, I did not find a way to get tesseract to output the confidence associated with each detected character, although I know from the tesseract algorithm description that the information is computed and stored somewhere.
For the Extraction of Service Items step, I developed a bespoke R script to identify lines in the extracted OCR output that satisfy the pattern: the line has multiple columns, the last of which is a dollar amount (in several possible formats). The script also performs basic consistency checks and can correct a range of possible OCR errors. The reason for picking R is that it has good support for text processing and I’m already a fan of R. 🙂
Post-processing of OCR-extracted service items is actually quite a tricky thing to get right. A service item takes the following general form:
ID (optional) ItemDescription Units, UnitPrice, Discount (optional) Price
There are several complexities in extracting the final price and the issues we commonly face are listed in this table:
The general technique is to use regular expressions and other string manipulation functions to correct the different errors and being careful to take the surrounding context into account.
OCR-extracted item descriptions can also be problematic and requires a post-processing step. We use term-frequency vectors with a semi-automatically generated dictionary to find canonical representations of item descriptions and remove noisy characters (usually associated with the Chinese characters at the end of an item description). We also use Google Refine to iteratively find similar items using different text-clustering algorithms.
The Removal of Sensitive Data step is also done in R, making use primarily of regular expressions.
Finally, we also use R as the scripting language that glues the different components together. The “parallel” package in R is particularly helpful here as it allows the parallelisation of all the tasks with minimal code changes.
As you can see, digitisation can be a bit of a messy process. As is usual in such problems, the Pareto principle applies: 20% of the effort gets you 80% of the benefit. Beyond that, one can perhaps use a crowdsourcing platform like Mechanical Turk to have the digitisation results manually checked and corrected.