一.什么是OCR
OCR (Optical Character Recognition,光学字符识别)是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文字的过程
方案 |
说明 |
百度OCR |
收费 |
Tesseract-OCR |
Google维护的开源OCR引擎,支持Java,Python等语言调用 |
Tess4J |
封装了Tesseract-OCR ,支持Java调用 |
二.Tesseract-OCR 的特点
三.使用案例
1.导入相关的依赖
1 2 3 4 5
| <dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>4.1.1</version> </dependency>
|
2.导入中文字体库
地址: https://wwvc.lanzouj.com/iuPhc1h7j46f
chi_sim.traineddata
3.编写测试类进行测试
待识别的图片
测试程序
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| package com.heima;
import net.sourceforge.tess4j.Tesseract; import net.sourceforge.tess4j.TesseractException;
import java.io.File;
public class Main {
public static void main(String[] args) throws TesseractException { Tesseract tesseract = new Tesseract(); tesseract.setDatapath("C:\\Gong\\data\\tess4j"); tesseract.setLanguage("chi_sim"); String ocr = tesseract.doOCR(new File("C:\\Gong\\data\\tess4j\\tess4j.png")); System.out.println(ocr); } }
|
识别的结果
四.封装成工具类使用
1.创建工具类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| package com.heima.common.tess4j;
import lombok.Getter; import lombok.Setter; import net.sourceforge.tess4j.ITesseract; import net.sourceforge.tess4j.Tesseract; import net.sourceforge.tess4j.TesseractException; import org.springframework.boot.context.properties.ConfigurationProperties; import org.springframework.stereotype.Component;
import java.awt.image.BufferedImage;
@Getter @Setter @Component @ConfigurationProperties(prefix = "tess4j") public class Tess4jClient {
private String dataPath; private String language;
public String doOCR(BufferedImage image) throws TesseractException { ITesseract tesseract = new Tesseract(); tesseract.setDatapath(dataPath); tesseract.setLanguage(language); String result = tesseract.doOCR(image); result = result.replaceAll("\\r|\\n", "-").replaceAll(" ", ""); return result; }
}
|
2.配置文件中添加配置
1 2 3
| tess4j: data-path: C:\workspace\tessdata language: chi_sim
|