Skip to content

Poylib/pdfToJson

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

PDF Crawling / WIPS ๋งํฌ ์‹คํ—˜ ๋„๊ตฌ

์—‘์…€(.xlsx) ํŒŒ์ผ์˜ ์ถœ์›๋ฒˆํ˜ธ ์—ด์—์„œ ๋งํฌ๋ฅผ ์ถ”์ถœํ•˜๊ณ , WIPS ์ƒ์„ธ ํŽ˜์ด์ง€์—์„œ ๋ณธ๋ฌธ(๋ฐœ๋ช…์˜ ์„ค๋ช…)๊ณผ ์ƒ๋‹จ ์š”์•ฝ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘/๋ฏธ๋ฆฌ๋ณด๊ธฐ/๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ„๋‹จํ•œ Streamlit ์•ฑ์ž…๋‹ˆ๋‹ค.

๋น ๋ฅธ ์‹œ์ž‘

# 1) ๊ฐ€์ƒํ™˜๊ฒฝ ์ƒ์„ฑ ๋ฐ ํ™œ์„ฑํ™” (macOS/Linux)
python3 -m venv .venv
source .venv/bin/activate

# 2) ์˜์กด์„ฑ ์„ค์น˜
pip install -U pip
pip install -r requirements.txt

# 3) Playwright ๋ธŒ๋ผ์šฐ์ € ๋ฐ”์ด๋„ˆ๋ฆฌ ์„ค์น˜(ํ•„์ˆ˜)
python -m playwright install chromium

# 4) ์•ฑ ์‹คํ–‰
streamlit run app.py

๋ธŒ๋ผ์šฐ์ €๊ฐ€ ์ž๋™์œผ๋กœ ์—ด๋ฆฌ๋ฉฐ, ์—ด๋ฆฌ์ง€ ์•Š์œผ๋ฉด ์ถœ๋ ฅ๋˜๋Š” URL(์˜ˆ: http://localhost:8501)๋กœ ์ ‘์†ํ•˜์„ธ์š”.

์‚ฌ์ „ ์š”๊ตฌ ์‚ฌํ•ญ

  • Python 3.10 ์ด์ƒ ๊ถŒ์žฅ
  • macOS ๋˜๋Š” Linux (Windows๋„ ๋™์ž‘ ๊ฐ€๋Šฅํ•˜๋‚˜ ๋ช…๋ น์€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ)
  • ๋„คํŠธ์›Œํฌ ์ ‘์† ๊ฐ€๋Šฅ ํ™˜๊ฒฝ (WIPS ํŽ˜์ด์ง€ ์ ‘๊ทผ ํ•„์š”)

์‹คํ–‰ ๋ฐฉ๋ฒ•(์ƒ์„ธ)

  1. ๊ฐ€์ƒํ™˜๊ฒฝ ์ค€๋น„ ๋ฐ ์˜์กด์„ฑ ์„ค์น˜

    • requirements.txt์—๋Š” streamlit, openpyxl, requests, playwright๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์„ค์น˜ ํ›„ ๋ฐ˜๋“œ์‹œ python -m playwright install chromium์„ ์‹คํ–‰ํ•ด Chromium ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ๋‚ด๋ ค๋ฐ›์œผ์„ธ์š”.
  2. ์•ฑ ์‹คํ–‰

    • ์•„๋ž˜ ๋ช…๋ น์œผ๋กœ ์•ฑ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    streamlit run app.py

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

  • ์—‘์…€ ์—…๋กœ๋“œ

    • ํŽ˜์ด์ง€ ์ƒ๋‹จ์˜ ์—…๋กœ๋”์— .xlsx ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
    • ์ƒ์œ„ 50ํ–‰์—์„œ ์ถœ์›๋ฒˆํ˜ธ ์—ด์„ ์ž๋™ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค. ์ž๋™ ํƒ์ƒ‰์ด ์‹คํŒจํ•˜๋ฉด ํ™”๋ฉด์—์„œ ์ˆ˜๋™์œผ๋กœ ์—ด์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ถ”์ถœ๋œ ๋งํฌ ๋ชฉ๋ก์„ ํ‘œ๋กœ ํ™•์ธํ•˜๊ณ , JSONL๋กœ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐœ๋ช…์˜ ์„ค๋ช…(๋ณธ๋ฌธ) ์ˆ˜์ง‘

    • ์ถ”์ถœ๋œ ๋งํฌ ์ค‘ ์ž„์˜ 1๊ฑด์„ ์„ ํƒํ•˜๊ฑฐ๋‚˜, ์ง์ ‘ URL์„ ์ž…๋ ฅํ•˜์—ฌ ์ˆ˜์ง‘์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ์–ธ์–ด๋ณ„ ํƒญ์œผ๋กœ ๋ณธ๋ฌธ HTML์„ ๋ฏธ๋ฆฌ๋ณด๊ธฐ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ƒ๋‹จ ์š”์•ฝ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ(docSummaryInfo) ์ˆ˜์ง‘

    • ์—”์ง„์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      • ๋ธŒ๋ผ์šฐ์ € ๋ Œ๋”๋ง(๊ถŒ์žฅ): Playwright Chromium์œผ๋กœ ์‹ค์ œ ๋ Œ๋”๋ง ํ›„ ์ถ”์ถœ
      • ์š”์ฒญ/์ •๊ทœ์‹(๋น ๋ฆ„): HTML ์š”์ฒญ๊ณผ ์ •๊ทœ์‹์œผ๋กœ ์ถ”์ถœ
    • ๊ฒฐ๊ณผ๋ฅผ ํ‘œ/JSON์œผ๋กœ ํ™•์ธํ•˜๊ณ  JSONL ํŒŒ์ผ๋กœ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํฌํ•จ ํŒŒ์ผ

  • app.py: Streamlit ์•ฑ ์ง„์ž…์ 
  • requirements.txt: ํŒŒ์ด์ฌ ์˜์กด์„ฑ ๋ชฉ๋ก
  • wips.xlsx: ์˜ˆ์‹œ ์—‘์…€(์žˆ๋‹ค๋ฉด ํ…Œ์ŠคํŠธ์šฉ์œผ๋กœ ์—…๋กœ๋“œ)
  • wips_meta_one.jsonl: ์˜ˆ์‹œ/์ƒ˜ํ”Œ ์ถœ๋ ฅ(์žˆ๋‹ค๋ฉด ๊ตฌ์กฐ ์ฐธ๊ณ ์šฉ)

๋ฌธ์ œ ํ•ด๊ฒฐ

  • Playwright ๋ฏธ์„ค์น˜ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ

    • ๋ฉ”์‹œ์ง€์— ์•ˆ๋‚ด๋œ ๋Œ€๋กœ ์•„๋ž˜ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š”.
    python -m playwright install chromium
  • macOS์—์„œ ๊ถŒํ•œ ๋˜๋Š” ๋ณด์•ˆ ๊ฒฝ๊ณ ๋กœ ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ์‹คํ–‰๋˜์ง€ ์•Š์„ ๋•Œ

    • ๋ณด์•ˆ ๋ฐ ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ ์„ค์ •์—์„œ ์ฐจ๋‹จ ํ•ด์ œ ํ›„ ๋‹ค์‹œ ์‹คํ–‰ํ•˜๊ฑฐ๋‚˜, ํ„ฐ๋ฏธ๋„์—์„œ ์œ„ ์„ค์น˜ ๋ช…๋ น์„ ์žฌ์‹คํ–‰ํ•˜์„ธ์š”.
  • ๋„คํŠธ์›Œํฌ/์ ‘์† ๋ฌธ์ œ

    • ์‚ฌ๋‚ด ํ”„๋ก์‹œ ํ™˜๊ฒฝ์ด๋ผ๋ฉด HTTP_PROXY/HTTPS_PROXY ํ™˜๊ฒฝ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ผ์ด์„ ์Šค

๋‚ด๋ถ€ ์‹คํ—˜/๋ฐ๋ชจ ์šฉ๋„๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์™ธ๋ถ€ ๋ฐฐํฌ ์ „ ์‚ฌ๋‚ด ์ •์ฑ…์„ ํ™•์ธํ•˜์„ธ์š”.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages