画像内文字認識と PDF からの文字列抽出

PDF からの文字列抽出（macOS 編）

Poppler のインストール

多くの PDF ファイルには文字列の情報が格納されています．Adobe Acrobat や Mac のプレビューを使うとこの文字列をコピーして抽出することができます．ここでは，pdftotext というコマンドを使って文字列を取り出します．macOS では poppler というソフトウェアをインストールすると pdftotext コマンドが利用できるようになります．まず，Homebrew を使って poppler をインストールします．

(py39ocr) rinsaka@MacStudio2022 pyocr % brew install poppler⏎

インストールされた場所とバージョンを確認します．

(py39ocr) rinsaka@MacStudio2022 pyocr % which pdftotext ⏎
/opt/homebrew/bin/pdftotext
(py39ocr) rinsaka@MacStudio2022 pyocr % pdftotext -v ⏎
pdftotext version 22.06.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011, 2022 Glyph & Cog, LLC
(py39ocr) rinsaka@MacStudio2022 pyocr %

目次に戻る

pdftotext の利用

次にターミナルから pdftotext を利用してみます．tesseract_dataフォルダに pdf ファイルがあることに注意して pdftotext を実行します．実行すると PDF ファイルと同じフォルダにファイル名（厳密には基底名）が同じで拡張子が .txt のファイルが作成され，そのファイルに結果が出力されます．なお，データはこのページを参考に準備してください．

(py39ocr) rinsaka@MacStudio2022 pyocr % ls tesseract_data ⏎
en_1.docx		en_2.pdf		ja_1.docx
en_1.pdf		en_2_img.pdf		ja_1.pdf
en_1_img.pdf		en_2_img1.png		ja_1_img.pdf
en_1_img.png		en_2_img1_trim.png	ja_1_img.png
en_1_img_trim.png	en_2_img2.png		ja_1_img_trim.png
en_2.docx		en_2_img2_trim.png
(py39ocr) rinsaka@MacStudio2022 pyocr % pdftotext tesseract_data/en_1.pdf ⏎
(py39ocr) rinsaka@MacStudio2022 pyocr % ls tesseract_data ⏎
en_1.docx		en_2.docx		en_2_img2_trim.png
en_1.pdf		en_2.pdf		ja_1.docx
en_1.txt		en_2_img.pdf		ja_1.pdf
en_1_img.pdf		en_2_img1.png		ja_1_img.pdf
en_1_img.png		en_2_img1_trim.png	ja_1_img.png
en_1_img_trim.png	en_2_img2.png		ja_1_img_trim.png
(py39ocr) rinsaka@MacStudio2022 pyocr % cat tesseract_data/en_1.txt ⏎
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.


(py39ocr) rinsaka@MacStudio2022 pyocr %

日本語の場合も全く同じ方法で取得できます．

(py39ocr) rinsaka@MacStudio2022 pyocr % pdftotext tesseract_data/ja_1.pdf ⏎
(py39ocr) rinsaka@MacStudio2022 pyocr % ls tesseract_data ⏎
en_1.docx		en_2.pdf		ja_1.pdf
en_1.pdf		en_2_img.pdf		ja_1.txt
en_1.txt		en_2_img1.png		ja_1_img.pdf
en_1_img.pdf		en_2_img1_trim.png	ja_1_img.png
en_1_img.png		en_2_img2.png		ja_1_img_trim.png
en_1_img_trim.png	en_2_img2_trim.png
en_2.docx		ja_1.docx
(py39ocr) rinsaka@MacStudio2022 pyocr % cat tesseract_data/ja_1.txt ⏎
研究者が⾃⾝で収集した学術論⽂の⽂献 PDF ファイルを効率的に管理し，研究活動に有
効活⽤することを⽬的として，⽂献 PDF データベースシステムを開発した．利⽤者は PDF
ファイルを Web ブラウザからサーバにアップロードすることで，PDF ファイルを⼀元的
に管理できるようになるとともに，全⽂検索，ジャーナル検索，著者検索，タグ（キーワー
ド）検索が利⽤できるようになる．また，論⽂情報の登録などに BIBTEX 情報を活⽤する
ことも本システムの特徴のひとつである．本論⽂では⽂献 PDF データベースシステムの
詳細について議論するとともに，性能評価実験の結果を考察する．


(py39ocr) rinsaka@MacStudio2022 pyocr %

目次に戻る

Python から pdftotext を利用（pdftotext のインストール）

pdftotext を Python から利用するためには pip を使ってインストールする必要があります．

(py39ocr) rinsaka@MacStudio2022 pyocr % pip install pdftotext ⏎
Collecting pdftotext
  Downloading pdftotext-2.2.2.tar.gz (113 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.9/113.9 kB 677.6 kB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: pdftotext
  Building wheel for pdftotext (setup.py) ... done
  Created wheel for pdftotext: filename=pdftotext-2.2.2-cp39-cp39-macosx_11_0_arm64.whl size=7349 sha256=37d7cb3cae720a640d08031e8b6cfb6c2cbdb154bd09e19349f5dc50ecd5dcfe
  Stored in directory: /Users/rinsaka/Library/Caches/pip/wheels/e4/e5/45/636cc09f9a6770e3ffafdfc7317486c76b234a3b0e21409a3e
Successfully built pdftotext
Installing collected packages: pdftotext
Successfully installed pdftotext-2.2.2
(py39ocr) rinsaka@MacStudio2022 pyocr %

目次に戻る

Python から pdftotext を利用

まず，英文が入力された 1 ページの PDF ファイルを pdftotext によって Python で文字列を抽出します．


import os
import pdftotext

file_path = os.path.sep.join(['tesseract_data', 'en_1.pdf'])

f = open(file_path, 'rb')
pdf = pdftotext.PDF(f)
f.close()

print(pdf)
print(pdf[0]) # 1ページ目

<pdftotext.PDF object at 0x107b653f0>
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.

日本語の PDF ファイルでも同じように抽出できます．


import os
import pdftotext

file_path = os.path.sep.join(['tesseract_data', 'ja_1.pdf'])

f = open(file_path, 'rb')
pdf = pdftotext.PDF(f)
f.close()

print(pdf)
print(pdf[0]) # 1ページ目

<pdftotext.PDF object at 0x1080579f0>
研究者が⾃⾝で収集した学術論⽂の⽂献 PDF ファイルを効率的に管理し，研究活動に有
効活⽤することを⽬的として，⽂献 PDF データベースシステムを開発した．利⽤者は PDF
ファイルを Web ブラウザからサーバにアップロードすることで，PDF ファイルを⼀元的
に管理できるようになるとともに，全⽂検索，ジャーナル検索，著者検索，タグ（キーワー
ド）検索が利⽤できるようになる．また，論⽂情報の登録などに BIBTEX 情報を活⽤する
ことも本システムの特徴のひとつである．本論⽂では⽂献 PDF データベースシステムの
詳細について議論するとともに，性能評価実験の結果を考察する．

2ページの PDF から文字列を抽出する場合には次のように書くことも可能です．


import os
import pdftotext

file_path = os.path.sep.join(['tesseract_data', 'en_2.pdf'])

f = open(file_path, 'rb')
pdf = pdftotext.PDF(f)
f.close()

print(pdf)
print('----- p.1 -----')
print(pdf[0]) # 1ページ目
print('----- p.2 -----')
print(pdf[1]) # 2ページ目

<pdftotext.PDF object at 0x107bbccc0>
----- p.1 -----
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.


----- p.2 -----
For digital gadgets, such as smartphones and tablets, manufacturers (or providers) usually
oﬀer a one year free-repair warranty against failures. This paper considers two types of failures.
The ﬁrst type of failure (Type-I failure) is a wear-out failure, which is warranted by the
manufacturers. The second type of failure (Type-II failure) is an accidental failure, which is
not warranted by the manufacturers. In this paper, we propose an extended warranty service
contract covering both Type-I and Type-II failures between a provider and a customer. Aiming
to contribute to the establishment of a method for determining a suitable price to the extended
warranty service contract fee, this paper discusses the optimal strategy for the provider
considering the reaction of the customer.

しかしながら実際には任意のページ数からなる PDF ファイルに対応するために次のように記述すると良いでしょう．


import os
import pdftotext

file_path = os.path.sep.join(['tesseract_data', 'en_2.pdf'])

f = open(file_path, 'rb')
pdf = pdftotext.PDF(f)
f.close()

for i, text in enumerate(pdf):
  print(f'----- p.{i + 1} -----')
  print(text)

----- p.1 -----
In this paper, we consider a nonparametric adaptive software rejuvenation schedule under a
random censored data. For u failure time data and v random censored data, we formulate
upper and lower bounds of the predictive system availability based on a nonparametric
predictive inference (NPI). Then, we derive adaptive rejuvenation policies which maximizes
the upper or lower bound. In simulation experiments, we show that estimates of the software
rejuvenation schedule are updated by acquisition of new failure data, and converge to the
theoretical optimal solution.


----- p.2 -----
For digital gadgets, such as smartphones and tablets, manufacturers (or providers) usually
oﬀer a one year free-repair warranty against failures. This paper considers two types of failures.
The ﬁrst type of failure (Type-I failure) is a wear-out failure, which is warranted by the
manufacturers. The second type of failure (Type-II failure) is an accidental failure, which is
not warranted by the manufacturers. In this paper, we propose an extended warranty service
contract covering both Type-I and Type-II failures between a provider and a customer. Aiming
to contribute to the establishment of a method for determining a suitable price to the extended
warranty service contract fee, this paper discusses the optimal strategy for the provider
considering the reaction of the customer.

目次に戻る

« 戻る次へ »