画像内文字認識と PDF からの文字列抽出

PDF からの文字列抽出（Windows 編）

Poppler のインストール

多くの PDF ファイルには文字列の情報が格納されています．Adobe Acrobat などを使うとこの文字列をコピーして抽出することができます．ここでは，pdftotext というコマンドを使って文字列を取り出します．Windows では Poppler というソフトウェアをインストールすると pdftotext コマンドが利用できるようになります．

Windows 版の Poppler は https://blog.alivate.com.au/poppler-windows/ からダウンロードできます．このページから最新版のバイナリ (latest binary) をダウンロードしてください．ダウンロードされるファイルは拡張子が .7z の圧縮ファイルです．このファイルを展開するためには別途ソフトウェアが必要になる可能性があることに注意してください．展開できれば「C:\Program Files (x86) \poppler」としてコピーします．

コピーができれば，Path の設定を行います．前のページと同じように，Windows の「設定」アプリケーションにある「詳細情報」で，「システムの詳細設定」を開きます．さらに「環境変数」をクリックします．

「システム環境変数」の「Path」を選択して「編集」をクリックします．

インストール先の「bin」フォルダを指定（入力またはコピー＆ペースト）して追加します．

目次に戻る

pdftotext の利用

Path の設定ができればコマンドプロンプトを起動（既に起動していた場合は再起動）します．念の為，pdftotext のインストール先とバージョンを確認してみます．もしも動作しない場合は Path の設定が間違っているか，コマンドプロンプトを再起動していないために Path の変更が有効化されていない可能性などが原因として考えられます．

C:\Users\lecture>where pdftotext ⏎
C:\Program Files (x86)\poppler\bin\pdftotext.exe

C:\Users\lecture>pdftotext -v ⏎
pdftotext version 0.68.0
Copyright 2005-2018 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

C:\Users\lecture>

次にコマンドプロンプトから pdftotext を利用してみます．tesseract_dataフォルダに pdf ファイルがあることに注意して pdftotext を実行します．実行すると PDF ファイルと同じフォルダにファイル名（厳密には基底名）が同じで拡張子が .txt のファイルが作成され，そのファイルに結果が出力されます．なお，データはこのページを参考に準備してください．

C:\Users\lecture>cd Documents\python\pyocr ⏎

C:\Users\lecture\Documents\python\pyocr>dir tesseract_data /w ⏎
 ドライブ C のボリューム ラベルは OS です
 ボリューム シリアル番号は 9018-19A1 です

 C:\Users\lecture\Documents\python\pyocr\tesseract_data のディレクトリ

[.]                  [..]                 en_1.docx            en_1.pdf             en_1_img.pdf
en_1_img.png         en_1_img_trim.png    en_2.docx            en_2.pdf             en_2_img.pdf
en_2_img1.png        en_2_img1_trim.png   en_2_img2.png        en_2_img2_trim.png   ja_1.docx
ja_1.pdf             ja_1_img.pdf         ja_1_img.png         ja_1_img_trim.png
              17 個のファイル           3,007,123 バイト
               2 個のディレクトリ  79,935,844,352 バイトの空き領域

C:\Users\lecture\Documents\python\pyocr>pdftotext tesseract_data\en_1.pdf ⏎

C:\Users\lecture\Documents\python\pyocr>dir tesseract_data /w ⏎
 ドライブ C のボリューム ラベルは OS です
 ボリューム シリアル番号は 9018-19A1 です

 C:\Users\lecture\Documents\python\pyocr\tesseract_data のディレクトリ

[.]                  [..]                 en_1.docx            en_1.pdf             en_1.txt
en_1_img.pdf         en_1_img.png         en_1_img_trim.png    en_2.docx            en_2.pdf
en_2_img.pdf         en_2_img1.png        en_2_img1_trim.png   en_2_img2.png        en_2_img2_trim.png
ja_1.docx            ja_1.pdf             ja_1_img.pdf         ja_1_img.png         ja_1_img_trim.png
              18 個のファイル           3,007,700 バイト
               2 個のディレクトリ  79,902,875,648 バイトの空き領域

C:\Users\lecture\Documents\python\pyocr>type tesseract_data\en_1.txt ⏎
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.


C:\Users\lecture\Documents\python\pyocr>

日本語の場合も全く同じ方法で取得できますが，予めコマンドプロンプトの文字コードを変更しておくと良いでしょう．詳細はここを参照してください．

C:\Users\lecture\Documents\python\pyocr>chcp ⏎
現在のコード ページ: 932

C:\Users\lecture\Documents\python\pyocr>chcp 65001 ⏎
Active code page: 65001

C:\Users\lecture\Documents\python\pyocr>dir tesseract_data /w ⏎
 Volume in drive C is OS
 Volume Serial Number is 9018-19A1

 Directory of C:\Users\lecture\Documents\python\pyocr\tesseract_data

[.]                  [..]                 en_1.docx            en_1.pdf             en_1.txt             en_1_img.pdf
en_1_img.png         en_1_img_trim.png    en_2.docx            en_2.pdf             en_2_img.pdf         en_2_img1.png
en_2_img1_trim.png   en_2_img2.png        en_2_img2_trim.png   ja_1.docx            ja_1.pdf             ja_1_img.pdf
ja_1_img.png         ja_1_img_trim.png
              18 File(s)      3,007,700 bytes
               2 Dir(s)  79,852,638,208 bytes free

C:\Users\lecture\Documents\python\pyocr>pdftotext tesseract_data\ja_1.pdf ⏎

C:\Users\lecture\Documents\python\pyocr>dir tesseract_data /w ⏎
 Volume in drive C is OS
 Volume Serial Number is 9018-19A1

 Directory of C:\Users\lecture\Documents\python\pyocr\tesseract_data

[.]                  [..]                 en_1.docx            en_1.pdf             en_1.txt             en_1_img.pdf
en_1_img.png         en_1_img_trim.png    en_2.docx            en_2.pdf             en_2_img.pdf         en_2_img1.png
en_2_img1_trim.png   en_2_img2.png        en_2_img2_trim.png   ja_1.docx            ja_1.pdf             ja_1.txt
ja_1_img.pdf         ja_1_img.png         ja_1_img_trim.png
              19 File(s)      3,008,509 bytes
               2 Dir(s)  79,842,598,912 bytes free

C:\Users\lecture\Documents\python\pyocr>type tesseract_data\ja_1.txt ⏎
研究者が⾃⾝で収集した学術論⽂の⽂献 PDF ファイルを効率的に管理し，研究活動に有
効活⽤することを⽬的として，⽂献 PDF データベースシステムを開発した．利⽤者は PDF
ファイルを Web ブラウザからサーバにアップロードすることで，PDF ファイルを⼀元的
に管理できるようになるとともに，全⽂検索，ジャーナル検索，著者検索，タグ（キーワー
ド）検索が利⽤できるよ���になる．また，論⽂情報の登録などに BIBTEX 情報を活⽤する
ことも本システムの特徴のひとつである．本論⽂では⽂献 PDF データベースシステムの
詳細について議論するとともに，性能評価実験の結果を考察する．


C:\Users\lecture\Documents\python\pyocr>

目次に戻る

Python から pdftotext を利用（pdftotext のインストール）

pdftotext を Python から利用するためには pip を使ってインストールする必要があります．まず，Anaconda Prompt を起動し，既にここで作成した仮想環境 py39ocr に切り替えます．

(base) C:\Users\lecture>conda env list ⏎
# conda environments:
#
base                  *  C:\Users\lecture\anaconda3
py39ocr                  C:\Users\lecture\anaconda3\envs\py39ocr


(base) C:\Users\lecture>conda activate py39ocr ⏎

(py39ocr) C:\Users\lecture>

仮想環境にまず poppler をインストールします．

(py39ocr) C:\Users\lecture>conda install -c conda-forge poppler ⏎

続いて pdftotext をインストールします．

(py39ocr) C:\Users\lecture>pip install pdftotext ⏎
Collecting pdftotext
  Using cached pdftotext-2.2.2.tar.gz (113 kB)
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: pdftotext
  Building wheel for pdftotext (setup.py) ... done
  Created wheel for pdftotext: filename=pdftotext-2.2.2-cp39-cp39-win_amd64.whl size=11578 sha256=573bfbe716faecff542c0de2bf016c458bb1d47c2a57ed8dc0672445cd871ca6
  Stored in directory: c:\users\lecture\appdata\local\pip\cache\wheels\e4\e5\45\636cc09f9a6770e3ffafdfc7317486c76b234a3b0e21409a3e
Successfully built pdftotext
Installing collected packages: pdftotext
Successfully installed pdftotext-2.2.2

(py39ocr) C:\Users\lecture>

なお，インストールの途中で次のようなエラーメッセージが表示されてインストールに失敗することがあります．この場合は指示に従い，https://visualstudio.microsoft.com/ja/downloads/ から「Visual Studio 2022 用のツール」を開き，「Build Tools for Visual Studio 2022」をダウンロードしてインストールしてください．

error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

インストールができているかどうかは pip list コマンドで確認できます．

(py39ocr) C:\Users\lecture>pip list ⏎
Package              Version
-------------------- -----------
...（中略）...

pdftotext            2.2.2

...（中略）...
(py39ocr) C:\Users\lecture>

目次に戻る

Python から pdftotext を利用

Jupyter Notebook を使って PDF から文字列を取り出す処理を行ってみます．まず，フォルダを移動して Jupyter Notebook を起動します．

(py39ocr) C:\Users\lecture>cd Documents\python\pyocr ⏎

(py39ocr) C:\Users\lecture\Documents\python\pyocr>jupyter notebook ⏎

英文が入力された 1 ページの PDF ファイルを pdftotext によって Python で文字列を抽出します．


import os
import pdftotext

file_path = os.path.sep.join(['tesseract_data', 'en_1.pdf'])

f = open(file_path, 'rb')
pdf = pdftotext.PDF(f)
f.close()

print(pdf)
print(pdf[0]) # 1ページ目

<pdftotext.PDF object at 0x000001B76CBC2E70>
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.

2ページの PDF から文字列を抽出する場合には次のように書くことが可能です．


import os
import pdftotext

file_path = os.path.sep.join(['tesseract_data', 'en_2.pdf'])

f = open(file_path, 'rb')
pdf = pdftotext.PDF(f)
f.close()

print(pdf)
print('----- p.1 -----')
print(pdf[0]) # 1ページ目
print('----- p.2 -----')
print(pdf[1]) # 2ページ目

<pdftotext.PDF object at 0x000001B76CBC2D80>
----- p.1 -----
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.


----- p.2 -----
For digital gadgets, such as smartphones and tablets, manufacturers (or providers) usually
oﬀer a one year free-repair warranty against failures. This paper considers two types of failures.
The ﬁrst type of failure (Type-I failure) is a wear-out failure, which is warranted by the
manufacturers. The second type of failure (Type-II failure) is an accidental failure, which is
not warranted by the manufacturers. In this paper, we propose an extended warranty service
contract covering both Type-I and Type-II failures between a provider and a customer. Aiming
to contribute to the establishment of a method for determining a suitable price to the extended
warranty service contract fee, this paper discusses the optimal strategy for the provider
considering the reaction of the customer.

しかしながら実際には任意のページ数からなる PDF ファイルに対応するために次のように記述すると良いでしょう．


import os
import pdftotext

file_path = os.path.sep.join(['tesseract_data', 'en_2.pdf'])

f = open(file_path, 'rb')
pdf = pdftotext.PDF(f)
f.close()

for i, text in enumerate(pdf):
  print(f'----- p.{i + 1} -----')
  print(text)

----- p.1 -----
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.


----- p.2 -----
For digital gadgets, such as smartphones and tablets, manufacturers (or providers) usually
oﬀer a one year free-repair warranty against failures. This paper considers two types of failures.
The ﬁrst type of failure (Type-I failure) is a wear-out failure, which is warranted by the
manufacturers. The second type of failure (Type-II failure) is an accidental failure, which is
not warranted by the manufacturers. In this paper, we propose an extended warranty service
contract covering both Type-I and Type-II failures between a provider and a customer. Aiming
to contribute to the establishment of a method for determining a suitable price to the extended
warranty service contract fee, this paper discusses the optimal strategy for the provider
considering the reaction of the customer.

目次に戻る

« 戻る次へ »