PDF 파서 5종 비교 분석 - 같은 문서, 다른 결과 | Syshin's Blog

테스트 문서

Attention Is All You Need (Vaswani et al., 2017)

15페이지, 2단 레이아웃
테이블 4개 (Table 1~4, rowspan/colspan 포함)
수식 다수 (인라인 + 디스플레이)
Figure 2개
각주, 참고문헌

원본 PDF 5페이지 (테이블 포함)

1. 헤딩 레벨 비교

논문의 헤딩 구조: 3 Model Architecture → 3.1 Encoder → 3.2 Attention → 3.2.1 Scaled Dot-Product

원본	MinerU	Marker	Docling	PyMuPDF4LLM	LiteParse
(H1) Title	`#`	`#`	`##`	`#`	(없음)
3 Model Arch	`#`	`#`	`##`	`##`	(없음)
3.1 Encoder	`#`	`###`	`##`	`##`	(없음)
3.2.1 Scaled	`#`	`####`	`##`	`##`	(없음)

실제 출력:

MinerU - 전부 # (H1):

# 3 Model Architecture
# 3.1 Encoder and Decoder Stacks
# 3.2.1 Scaled Dot-Product Attention

Marker - H1/H3/H4 구분 (유일!):

# 3 Model Architecture
### 3.1 Encoder and Decoder Stacks
#### Scaled Dot-Product Attention

Docling - 전부 ## (H2):

## 3 Model Architecture
## 3.1 Encoder and Decoder Stacks
## Scaled Dot-Product Attention

PyMuPDF4LLM - #/## 혼합:

# **Attention Is All You Need**
## **3 Model Architecture**
## **3.2.1 Scaled Dot-Product Attention**

LiteParse - 헤딩 구분 없음 (순수 텍스트):

3 Model Architecture
3.1 Encoder and Decoder Stacks

결론: Marker만 헤딩 계층을 구분한다. 나머지는 전부 단일 레벨.

2. 테이블 비교 (Table 1)

원본 Table 1: Layer Type별 복잡도 비교

원본 Table 1 (6페이지)

MinerU - HTML `<table>`

원본 소스:

<table><tr><td>Layer Type</td><td>Complexity per Layer</td>
<td>Sequential Operations</td><td>Maximum Path Length</td></tr>
<tr><td>Self-Attention</td><td>O(n² ·d)</td><td>0(1)</td><td>0(1)</td></tr>
<tr><td>Recurrent</td><td>O(n·d²)</td><td>O(n)</td><td>0(n）</td></tr>
...</table>

렌더링 결과:

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	O(n² ·d)	0(1)	0(1)
Recurrent	O(n·d²)	O(n)	0(n）
Convolutional	O(k ·n·d²)	0(1)	O(logk(n))
Self-Attention (restricted)	O(r ·n·d)	0(1)	0(n/r)

Marker - Markdown + LaTeX 수식

원본 소스:

| Layer Type | Complexity per Layer | Sequential<br>Operations | Maximum Path Length |
|---|---|---|---|
| Self-Attention | $O(n^2 \cdot d)$ | O(1) | O(1) |
| Recurrent | $O(n \cdot d^2)$ | O(n) | O(n) |
| Convolutional | $O(k \cdot n \cdot d^2)$ | O(1) | $O(log_k(n))$ |
| Self-Attention (restricted) | $O(r \cdot n \cdot d)$ | O(1) | O(n/r) |

렌더링 결과:

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	$O(n^2 \cdot d)$	O(1)	O(1)
Recurrent	$O(n \cdot d^2)$	O(n)	O(n)
Convolutional	$O(k \cdot n \cdot d^2)$	O(1)	$O(log_k(n))$
Self-Attention (restricted)	$O(r \cdot n \cdot d)$	O(1)	O(n/r)

Docling - 깔끔한 Markdown

원본 소스:

| Layer Type                  | Complexity per Layer   | Sequential Operations   | Maximum Path Length   |
|-----------------------------|------------------------|-------------------------|-----------------------|
| Self-Attention              | O ( n 2 · d )          | O (1)                   | O (1)                 |
| Recurrent                   | O ( n · d 2 )          | O ( n )                 | O ( n )               |

렌더링 결과:

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	O ( n 2 · d )	O (1)	O (1)
Recurrent	O ( n · d 2 )	O ( n )	O ( n )
Convolutional	O ( k · n · d 2 )	O (1)	O ( log k ( n ))
Self-Attention (restricted)	O ( r · n · d )	O (1)	O ( n/r )

PyMuPDF4LLM - Markdown + 이탤릭 수식

원본 소스:

|Layer Type|Complexity per Layer|Sequential|Maximum Path Length|
|---|---|---|---|
|||Operations||
|Self-Attention|_O_(_n_2 _· d_)|_O_(1)|_O_(1)|
|Recurrent|_O_(_n · d_2)|_O_(_n_)|_O_(_n_)|

렌더링 결과:

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	O(_n_2 · d)	O(1)	O(1)
Recurrent	O(_n · d_2)	O(n)	O(n)
Convolutional	O(_k · n · d_2)	O(1)	O(logk(n))
Self-Attention (restricted)	O(r · n · d)	O(1)	O(n/r)

LiteParse - ASCII 공간 배치

 Layer Type                Complexity per Layer  Sequential  Maximum Path Length
                                                 Operations
 Self-Attention                 O(n2 · d)           O(1)             O(1)
 Recurrent                      O(n · d2)           O(n)             O(n)
 Convolutional        O(k · n · d2)                 O(1)          O(logk(n))
 Self-Attention (restricted)  O(r · n · d)          O(1)            O(n/r)

결론:

MinerU: HTML 테이블로 구조 완벽 보존, 수식은 유니코드 (n²)
Marker: Markdown 테이블 + LaTeX 수식 ( $O(n^2)$ ) - 가장 풍부
Docling: 깔끔한 Markdown, 수식은 공백 포함 텍스트
PyMuPDF4LLM: Markdown 테이블, 수식은 이탤릭 텍스트
LiteParse: 공간 배치만, 구조 없음

3. 수식 비교

원본 수식: $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

MinerU:

$$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$$

Marker:

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Docling:

Attention(Q, K, V ) = softmax( QK T dk )V

PyMuPDF4LLM:

Attention( Q, K, V ) = softmax( QK T √dk )V

LiteParse:

Attention(Q, K, V ) = sof tmax( QKT     )V
                                  √dk

결론:

MinerU, Marker: LaTeX 변환 O ($$...$$ 블록)
Docling, PyMuPDF4LLM: 텍스트로만 추출 (LaTeX 변환 X)
LiteParse: 공간 배치로 √ 기호와 분수를 시각적으로 표현 (독특하지만 구조화 아님)

4. 이미지 추출 비교

파서	Figure 추출	방식
MinerU	O (이미지 파일 저장 + Markdown 참조)	YOLO 레이아웃 감지 → 영역 캡처
Marker	O (6개 이미지 저장)	Surya 레이아웃 → JPEG 저장
Docling	X (기본 Markdown에 미포함)	export_to_markdown()에 이미지 없음
PyMuPDF4LLM	설정 필요 (`write_images=True`)	PDF 내장 이미지 직접 추출
LiteParse	X	이미지 기능 없음

5. 벤치마크 결과 차트

READoc Edit Similarity (92 arXiv 논문)

READoc Edit Similarity 비교

속도 vs 품질

처리 속도 비교

파서별 처리 속도

OmniDocBench 요소별 성능 (이미지 기반)

OmniDocBench 레이더 차트

6. 종합 비교

항목	MinerU	Marker	Docling	PyMuPDF4LLM	LiteParse
헤딩 레벨	전부 H1	H1/H3/H4 구분	전부 H2	H1/H2 혼합	없음
테이블	HTML (rowspan OK)	Markdown + LaTeX	Markdown	Markdown	ASCII
수식	LaTeX	LaTeX	텍스트	이탤릭	공간 배치
이미지	O	O	X	설정 필요	X
속도	69초/문서	237초/문서	3.4초/문서	1.9초/문서	0.1초/문서
성공률	100%	34% (READoc)	100%	100%	100%
라이선스	AGPL	GPL	MIT	AGPL	Apache 2.0

용도별 추천

용도	추천	이유
수식 많은 학술 논문	Marker (짧은 문서) 또는 MinerU	둘 다 LaTeX 변환
상용 RAG 프로젝트	Docling	MIT, 안정, 빠름
스캔 문서 OCR	MinerU	PaddleOCR 최강
대량 배치 전처리	PyMuPDF4LLM	1.9초/문서, GPU 불필요
TypeScript 프로젝트	LiteParse	Node.js 네이티브
최고 품질 (짧은 문서)	Marker	헤딩+수식+이미지 모두 최고

테스트 문서

1. 헤딩 레벨 비교

2. 테이블 비교 (Table 1)

MinerU - HTML <table>

Marker - Markdown + LaTeX 수식

Docling - 깔끔한 Markdown

PyMuPDF4LLM - Markdown + 이탤릭 수식

LiteParse - ASCII 공간 배치

3. 수식 비교

4. 이미지 추출 비교

5. 벤치마크 결과 차트

READoc Edit Similarity (92 arXiv 논문)

속도 vs 품질

처리 속도 비교

OmniDocBench 요소별 성능 (이미지 기반)

6. 종합 비교

용도별 추천

참고

MinerU - HTML `<table>`