Start
I guess OCR is not something new for us. While there are a lot of open source artificial intelligence engines to achieve this, I need a easy-to-use one.
Recently I got a task to convert images into text. The image number is fairly big. So it’s just impossible to OCR them one by one manually. So I wrote a python script to handle this tedious task.
Basic Processing
The original images contain a identical useless frame around the part that I need. So a crop is required because it will improve the performance (of course, the image is smaller) and there are unrelated texts in the frame.
Cropping is a easy problem. Just install Pillow
package with pip
:
pip install Pillow
Then use Pillow
to do the cropping:
image_file_list = ["image1.png", "image2.png", ...]
crop_file_list = [f"crop-{image_file}" for image_file in image_file_list]
## left, top, width, height
geometry = (100, 200, 300, 400)
print("Target geometry:", geometry)
## convert to (left, top, right, bottom)
geometry_ltrb = (geometry[0], geometry[1], geometry[0] +
geometry[2], geometry[1] + geometry[3])
## crop image with geometry
for index, image_file in enumerate(image_file_list):
print(f"[{index + 1}/{len(image_file_list)}] Cropping '{image_file}' ...")
with Image.open(join(dir_path, image_file)) as image:
image.crop(geometry_ltrb).save(crop_file_list)
Now we have cropped images with original filename prefixed by crop-
.
Install PaddlePaddle
It’s not easy to install PaddlePaddle
with pip
because it needs to run some native compilation. Anaconda
is also complex to install and generates a lot of garbage files. The cleanest way is to use Docker
and with vscode
Remote Connect extensions.
Of course you need to install docker first, which is basically out of this blog’s scope.
Then run the following command to create and run the PaddlePaddle
image:
docker run -it --name ppocr -v "$PWD:/data" --network=host registry.baidubce.com/paddlepaddle/paddle:2.4.0-cpu /bin/bash
Something to note
-
You can change the mounted volumes to what you want to process.
-
This image is pulled from
Baidu
(the company creates PaddlePaddle) registry, which is fast in China. You can also pull it fromDockerHub
. -
This image’s PaddlePaddle is based on cpu. Of course you have a cpu in your computer. But if you have a GPU or even CUDA, you can select another image with correct tag. But cpu image is almost always work and using GPU is harder to configure.
-
I don’t known why
--network=host
is needed. The container does not publish any ports. But it can access Internet faster or VSCode Remote Connect needs it?
Install PaddleOCR
This image above only contain PaddlePaddle. PaddleOCR is another package based on it and needs individual install. However, this time we can just use pip
again.
pip install paddleocr
Coding
The next step is to write python codes. Also the easiest part! You can connect to the container you just created with vscode and then happy coding!
ocr = PaddleOCR(use_angle_cls=True, lang="ch") ## change the language to what you need
image_text_list = []
for index, crop_image_file in enumerate(crop_file_list):
print(f"[{index + 1}/{len(crop_file_list)}] OCRing '{crop_image_file}' ...")
result = ocr.ocr(crop_image_file, cls=True)
result = result[0] ## There is some inconsistence of official docs. Result is a list with single element.
line_text_list = [line[1][0] for line in result] ## a list of text str
image_text = "\n".join(line_text_list)
image_text_list.append(paragraph)
Now you can do any other things to the the image_text_list
.
Finally
Now just run the script. Or even better, customize it.
By the way, PaddleOCR
is far more accurate than tesseract
in Chinese. Maybe because it is created by Baidu, a Chinese local company or I missed some configuration. For English, I haven’t tested.
Last modified on 2023-01-03