Skip to content

hw725/CSP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

79 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“š CSP (Corpus Split Parallel) Pipeline

ํ•œ๋ฌธ ๊ณ ์ „ ์›๋ฌธ-๋ฒˆ์—ญ๋ฌธ ์ •๋ ฌ ํŒŒ์ดํ”„๋ผ์ธ

๐Ÿ“‹ ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

CSP๋Š” ํ•œ๋ฌธ ๊ณ ์ „ ๋ฌธํ—Œ์„ ๋ฌธ๋‹จโ†’๋ฌธ์žฅ์œผ๋กœ ๋ถ„ํ• (P2S) ํ•˜๊ณ  ๋ฌธ์žฅโ†’๊ตฌ๋กœ ๋ถ„ํ• ํ•˜์—ฌ 1:1 ์ •๋ ฌ(S2P) ํ•˜๋Š” ์ž‘์—…์„ ์ž๋™ํ™”ํ•˜๋Š” ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ๋Šฅ

  • P2S (Paragraph to Sentence): ๋ฌธ๋‹จ์„ ๋ฌธ์žฅ์œผ๋กœ ๋ถ„ํ• 
  • S2P (Sentence to Phrase): ๋ฌธ์žฅ์„ ๊ตฌ๋กœ ๋ถ„ํ• ํ•˜๊ณ  1:1 ์ •๋ ฌ (๊ฒฝ๊ณ„ ๋ชจ๋ธ v3 ๊ธฐ๋ณธ ์ ์šฉ)
  • ๋ฌด๊ฒฐ์„ฑ ๋ณด์žฅ: ์›๋ฌธ ๋ฌธ์ž 100% ๋ณด์กด (๊ณต๋ฐฑ ์™ธ ์†์‹ค ์—†์Œ)
  • GPU ๊ฐ€์†: CUDA ๊ธฐ๋ฐ˜ ๊ณ ์† ์ฒ˜๋ฆฌ

๐Ÿš€ ๋น ๋ฅธ ์‹œ์ž‘

Docker ํ™˜๊ฒฝ ์‹คํ–‰

docker-compose up -d
docker-compose exec csp bash

P2S ํŒŒ์ดํ”„๋ผ์ธ

python p2s/main.py <input.csv> <output.xlsx>

S2P ํŒŒ์ดํ”„๋ผ์ธ

python s2p/main.py <input.csv> <output.xlsx> [--batch-size 32]

๐Ÿ“Š ๋ชจ๋‹ˆํ„ฐ๋ง & ๋ถ„์„

๋งˆํฌ๋‹ค์šด ๋Œ€์‹œ๋ณด๋“œ (๋กœ์ปฌ ์„œ๋ฒ„)

์ž๋™ ์‹œ์ž‘ (๊ถŒ์žฅ):

# Windows (CMD) - ์ตœ์ƒ์œ„ ๋””๋ ‰ํ† ๋ฆฌ์—์„œ
start_md_server.bat

# Windows (PowerShell) - hyeonto ๋””๋ ‰ํ† ๋ฆฌ์—์„œ
.\hyeonto\start_md_server.ps1

# ๋˜๋Š” ๋น ๋ฅธ ์—ด๊ธฐ
open_dashboard.bat

๋˜๋Š” ์ˆ˜๋™ ์‹คํ–‰:

cd hyeonto
python md_server.py --port 8080
# ๋ธŒ๋ผ์šฐ์ €์—์„œ http://127.0.0.1:8080/dashboard.html ์—ด๊ธฐ

๋Œ€์‹œ๋ณด๋“œ ๊ธฐ๋Šฅ

  • K=3 ํด๋Ÿฌ์Šคํ„ฐ ๋ถ„์„: ํ˜„ํ†  ๋งˆ์ปค & ์„œ์ข… ๋ถ„ํฌ ๋ถ„์„
  • ์ž„๋ฒ ๋”ฉ ์‹œ๊ฐํ™”: 2D/3D UMAP ์˜ค๋ฒ„๋ ˆ์ด (ํ‘๋ฐฑ ์ธํฌ๊ทธ๋ž˜ํ”ฝ)
  • Sankey ๋‹ค์ด์–ด๊ทธ๋žจ: P2S โ†” S2P ํ๋ฆ„ ๋ถ„์„ (ํ‘๋ฐฑ)
  • ๊ฒ€์ฆ ๋ถ„์„: ๊ณต๊ธฐ์–ด & ์ด์ƒ์น˜ ํƒ์ง€
  • ๋ผ๋ฒจ ๋ณ€ํ™”: 1:1 vs 3:1 ๊ฐ€์ค‘์น˜ ๋น„๊ต
  • ๋งˆํฌ๋‹ค์šด ๋ทฐ์–ด: ๋ชจ๋“  ๋ถ„์„ ๋ฆฌํฌํŠธ ๋ Œ๋”๋ง

๋ถ„์„ ๋ฆฌํฌํŠธ


๐Ÿ“Š ์ฒ˜๋ฆฌ ์˜ˆ์‹œ

P2S (๋ฌธ๋‹จ โ†’ ๋ฌธ์žฅ ๋ถ„ํ• )

์ž…๋ ฅ:

๋ฌธ๋‹จ(์›๋ฌธ) ๋ฌธ๋‹จ(๋ฒˆ์—ญ๋ฌธ)
ๅ…ฌๅญ้–‹ๆ–นไบ‹ๅ› ไธๆญธ่ฆ–ๆญป็ˆถ. ่ก›ๆ‡ฟๅ…ฌๅฅฝ้ถด ไธๆคๆญปๅœ‹. ้ฝŠๆก“ๅ…ฌๅพ—ๅญไบ‚ๅœ‹ ๊ณต์ž๊ฐœ๋ฐฉ์ด ๊ตฐ์ฃผ๋ฅผ ์„ฌ๊ธฐ๋ฉฐ ์ฃฝ์€ ์•„๋ฒ„์ง€๋ฅผ ๋Œ์•„๋ณด์ง€ ์•Š์•˜๋‹ค. ์œ„์˜๊ณต์€ ํ•™์„ ์ข‹์•„ํ•˜์—ฌ ๋‚˜๋ผ๊ฐ€ ์ฃฝ๋Š” ๊ฒƒ์„ ๋Œ๋ณด์ง€ ์•Š์•˜๋‹ค. ์ œํ™˜๊ณต์€ ์ž๋ž€๊ตญ์„ ์–ป์—ˆ๋‹ค

์ถœ๋ ฅ:

๋ฌธ๋‹จID ๋ฌธ์žฅID ์›๋ฌธ(๋ถ„ํ• ) ๋ฒˆ์—ญ๋ฌธ(๋ถ„ํ• )
1 1 ๅ…ฌๅญ้–‹ๆ–นไบ‹ๅ› ไธๆญธ่ฆ–ๆญป็ˆถ ๊ณต์ž๊ฐœ๋ฐฉ์ด ๊ตฐ์ฃผ๋ฅผ ์„ฌ๊ธฐ๋ฉฐ ์ฃฝ์€ ์•„๋ฒ„์ง€๋ฅผ ๋Œ์•„๋ณด์ง€ ์•Š์•˜๋‹ค
1 2 ่ก›ๆ‡ฟๅ…ฌๅฅฝ้ถด ไธๆคๆญปๅœ‹ ์œ„์˜๊ณต์€ ํ•™์„ ์ข‹์•„ํ•˜์—ฌ ๋‚˜๋ผ๊ฐ€ ์ฃฝ๋Š” ๊ฒƒ์„ ๋Œ๋ณด์ง€ ์•Š์•˜๋‹ค
1 3 ้ฝŠๆก“ๅ…ฌๅพ—ๅญไบ‚ๅœ‹ ์ œํ™˜๊ณต์€ ์ž๋ž€๊ตญ์„ ์–ป์—ˆ๋‹ค

S2P (๋ฌธ์žฅ โ†’ ๊ตฌ ๋ถ„ํ• )

์ž…๋ ฅ:

์›๋ฌธ(์ƒ˜ํ”Œ) ๋ฒˆ์—ญ๋ฌธ(์ƒ˜ํ”Œ)
ๅ…ฌๅญ้–‹ๆ–นไบ‹ๅ› ไธๆญธ่ฆ–ๆญป็ˆถ ๊ณต์ž๊ฐœ๋ฐฉ์ด ๊ตฐ์ฃผ๋ฅผ ์„ฌ๊ธฐ๋ฉฐ ์ฃฝ์€ ์•„๋ฒ„์ง€๋ฅผ ๋Œ์•„๋ณด์ง€ ์•Š์•˜๋‹ค

์ถœ๋ ฅ (๊ตฌ๋ณ‘๋ ฌ):

๋ฌธ์žฅ์‹๋ณ„์ž ๊ตฌ์‹๋ณ„์ž ์›๋ฌธ๊ตฌ ๋ฒˆ์—ญ๊ตฌ
1 1 ๅ…ฌๅญ้–‹ๆ–น ๊ณต์ž๊ฐœ๋ฐฉ์ด
1 2 ไบ‹ๅ› ๊ตฐ์ฃผ๋ฅผ ์„ฌ๊ธฐ๋ฉฐ
1 3 ไธๆญธ่ฆ–ๆญป็ˆถ ์ฃฝ์€ ์•„๋ฒ„์ง€๋ฅผ ๋Œ์•„๋ณด์ง€ ์•Š์•˜๋‹ค

๐Ÿ“ ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ

CSP/
โ”œโ”€โ”€ p2s/           # Paragraph to Sentence ๋ชจ๋“ˆ
โ”œโ”€โ”€ s2p/           # Sentence to Phrase ๋ชจ๋“ˆ
โ”œโ”€โ”€ common/        # ๊ณตํ†ต ์œ ํ‹ธ๋ฆฌํ‹ฐ (embedders, tokenizers ๋“ฑ)
โ”œโ”€โ”€ accuracy/      # ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ
โ”œโ”€โ”€ datasets/      # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์…‹
โ”‚   โ”œโ”€โ”€ paragraph/ # ๋ฌธ๋‹จ ๋‹จ์œ„ (P2S ์ž…๋ ฅ)
โ”‚   โ”œโ”€โ”€ sentence/  # ๋ฌธ์žฅ ๋‹จ์œ„ (P2S ์ •๋‹ต / S2P ์ž…๋ ฅ)
โ”‚   โ””โ”€โ”€ phrase/    # ๊ตฌ ๋‹จ์œ„ (S2P ์ •๋‹ต)
โ”œโ”€โ”€ models/        # ํ•™์Šต๋œ ๊ฒฝ๊ณ„ ๋ชจ๋ธ
โ””โ”€โ”€ test_results/  # ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ๋ฌผ

๐Ÿ“ˆ ์„ฑ๋Šฅ (2026-01-23 ๊ธฐ์ค€)

P2S (Paragraph โ†’ Sentence)

์ง€ํ‘œ ๊ฐ’
F1 0.8724 (87.24%)
์›๋ฌธ ์œ ์‚ฌ๋„ 0.9174 (91.74%) (๋ฒˆ์—ญ๋ฌธ ์ผ์น˜ ๋ฌธ์žฅ ๊ธฐ์ค€)

S2P (Sentence โ†’ Phrase)

์ง€ํ‘œ ๊ฐ’
F1 0.8091 (80.91%)
๋ฒˆ์—ญ๋ฌธ ์œ ์‚ฌ๋„ 0.8323 (83.23%)

๏ฟฝ๏ธ ๋ฌด๊ฒฐ์„ฑ (Integrity) ๊ฒ€์ฆ (2026-01-23)

ํŒŒ์ดํ”„๋ผ์ธ ์ „์—ญ ๋ฌด๊ฒฐ์„ฑ ์„ค๋ช…
P2S PASS ์›๋ฌธ ํ…์ŠคํŠธ 100% ๋ณด์กด (์ •๊ทœํ™” ๊ธฐ์ค€)
S2P FAIL (-1 char) 1๊ธ€์ž(ๆ›ฐ) ๋ˆ„๋ฝ ํ™•์ธ๋จ. ๊ทธ ์™ธ 99.999% ์ผ์น˜.

๏ฟฝ๐Ÿ”„ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ๊ธฐ๋ก (2026-01-19)

PA/SA โ†’ P2S/S2P ์ „ํ™˜

  • ํด๋”๋ช… ๋ณ€๊ฒฝ: pa/ โ†’ p2s/, sa/ โ†’ s2p/
  • ๋ฐ์ดํ„ฐ์…‹ ์žฌ๊ตฌ์„ฑ: datasets/paragraph, datasets/sentence, datasets/phrase

S2P ์„ค์ • ์ˆ˜์ •

  • ๊ธฐ๋ณธ๊ฐ’ ๋ณ€๊ฒฝ: --use-boundary-model์„ ๊ธฐ๋ณธ True๋กœ ์„ค์ • (F1 0.8315 ์žฌํ˜„์šฉ)
  • ์ธ์ž ์ถ”๊ฐ€: --batch-size ์ธ์ž ์ถ”๊ฐ€ (๊ธฐ๋ณธ 32)
  • ๋ฌด๊ฒฐ์„ฑ ์ฒดํฌ ๊ฐœ์„ : ๊ณต๋ฐฑ/๊ฐœํ–‰/ํƒญ ๋ฌด์‹œํ•˜์—ฌ ์˜คํƒ ํ•ด๊ฒฐ

๊ฒ€์ฆ ๊ฒฐ๊ณผ (10๊ฐœ ์ƒ˜ํ”Œ)

  • P2S: F1 0.85, ๋ฒˆ์—ญ๋ฌธ ์ผ์น˜์œจ 100%
  • S2P: ํ…Œ์ŠคํŠธ ์ค‘ (๊ฒฝ๊ณ„ ๋ชจ๋ธ v3 ์ •์ƒ ๋™์ž‘)

๐Ÿ› ๏ธ ํ•ต์‹ฌ ๊ธฐ์ˆ 

  • ์ž„๋ฒ ๋”: BGE-M3 FlagModel (GPU ๊ฐ€์†)
  • ๊ฒฝ๊ณ„ ๋ชจ๋ธ: Cross-Attention ๊ธฐ๋ฐ˜ Boundary Tagger v3
  • ๊ตฌ๋ฌธ๋ถ„์„: SuPar-Kanbun (ํ•œ๋ฌธ) + Stanza (ํ•œ๊ตญ์–ด)
  • ํ† ํฌ๋‚˜์ด์ €: SikuBERT (ํ•œ๋ฌธ) + Kiwipiepy (ํ˜„ํ† )

๐Ÿณ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ

Docker (๊ถŒ์žฅ)

GPU/CUDA ์˜์กด์„ฑ๊ณผ ์žฌํ˜„์„ฑ์„ ์œ„ํ•ด Docker ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

# ์ปจํ…Œ์ด๋„ˆ ๋นŒ๋“œ ๋ฐ ์‹คํ–‰
docker-compose up -d
docker-compose exec csp bash

# ํ—ฌํผ ์Šคํฌ๋ฆฝํŠธ (Windows)
./docker.ps1 python scripts/example.py

ํŒจํ‚ค์ง€ ๊ด€๋ฆฌ

  • Docker: uv๋ฅผ ์‚ฌ์šฉํ•œ ๊ณ ์† ํŒจํ‚ค์ง€ ์„ค์น˜ (pip ๋Œ€๋น„ 2-10๋ฐฐ ๋น ๋ฆ„)
  • ๋กœ์ปฌ: .venv + requirements.txt (torch 2.9.1)
  • ๋ฒ„์ „ ์ฐจ์ด: Docker๋Š” torch==2.6.0 (๊ณต์‹ ์ด๋ฏธ์ง€ ๊ธฐ์ค€), ๋กœ์ปฌ์€ torch==2.9.1

๋ณด์•ˆ ์ ๊ฒ€

# ๋กœ์ปฌ ํ™˜๊ฒฝ ์ทจ์•ฝ์  ์ ๊ฒ€
./scripts/safety_check.ps1

# Docker ํ™˜๊ฒฝ ์ทจ์•ฝ์  ์ ๊ฒ€
./scripts/safety_check.ps1 -Docker

# ์ž๋™ ์ˆ˜์ • ์‹œ๋„
./scripts/safety_check.ps1 -Fix

์ตœ์ข… ์—…๋ฐ์ดํŠธ: 2026๋…„ 02์›” 04์ผ - ๋ชจ๋‹ˆํ„ฐ๋ง ๋Œ€์‹œ๋ณด๋“œ ๋ฐ K=3 ๋ถ„์„ ๋ฌธ์„œ ์ •ํ•ฉํ™”

About

Corpus split parallel

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •