Skip to content

Latest commit

ย 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

README.md

1. ๋ชจ๋˜ ๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง

๊ณผ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๋ฐฉ์‹

  • ETL
  • E: ์ถ”์ถœ Extract
  • T: ์Šค๋ฏธ์นด์— ๋งž๊ฒŒ ๋ณ€ํ™˜ Transform
  • L: ๋””๋น„์— ์ €์žฅ Load

๋‹ค์–‘ํ•ด์ง€๋Š” ๋ฐ์ดํ„ฐ ํ˜•์‹

  • ๋ฐ์ดํ„ฐ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์ด ๋‹ค์–‘ํ•ด์ง€๊ณ  ํ˜•ํƒœ๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ๋ถˆ๊ฐ€๋Šฅํ•ด์ง€๋ฉด์„œ ์Šคํ‚ค๋งˆ๋ฅผ ์ •์˜ํ•˜๊ธฐ ํž˜๋“ค์–ด ์กŒ๋‹ค.
    • ์‹ค์‹œ๊ฐ„์„ฑ์„ ์š”๊ตฌํ•˜๋Š” ๊ธฐ๋Šฅ๋“ค
    • ๋นจ๋ผ์ง€๋Š” ๊ธฐ๋Šฅ ์ถ”๊ฐ€
    • ์‹ค์‹œ๊ฐ„ ๋กœ๊ทธ
    • ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ
    • ์„œ๋“œ ํŒŒํ‹ฐ ๋ฐ์ดํ„ฐ

์ €๋ ดํ•ด์ง€๋Š” ์ปดํ“จํŒ… ํŒŒ์›Œ

  • ์ตœ๋Œ€ํ•œ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ์ €์žฅํ•ด๋‘๊ณ  ๋งŽ์€ ์–‘์˜ ํ”„๋กœ์„ธ์‹ฑ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋๋‹ค.
  • ์ปดํ“จํŒ… ํŒŒ์›Œ์— ๋Œ€ํ•œ ๋น„์šฉ ์ตœ์ ํ™”๋ณด๋‹ค ๋น„์ฆˆ๋‹ˆ์Šค์™€ ์†๋„๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ์ชฝ์ด ์ด๋“์ด ํฌ๊ฒŒ ๋๋‹ค.

ํ˜„์žฌ ๋ฐ์ดํ„ฐ๋ฅผ ์šด์˜ํ•˜๋Š” ๋ฐฉ์‹

  • ELT
  • E: ๋ฐ์ดํ„ฐ ์ถ”์ถœ Extract
  • L: ์ผ๋‹จ ์ €์žฅ
  • T: ์“ฐ์ž„์ƒˆ์— ๋”ฐ๋ผ ๋ณ€ํ™˜
  • ์˜ˆ
    • ๋ฐ์ดํ„ฐ์˜ ๋กœ๊ทธ๋ฅผ Spark๋‚˜ FLink๋ฅผ ํ†ตํ•ด ์–ด๋А์ •๋„ ์ •๋ฆฌ ํ›„ ์ €์žฅ (E&L)
    • ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ˜น์€ ๋ถ„์„ ํˆด์—์„œ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋„๋ก ๋ณ€ํ™˜ (T)
  • ์‹œ์Šคํ…œ์˜ ๋ณต์žก๋„์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ๊ณผ ์ ์žฌ๋ฅผ ํ•œ๋ฒˆ์— ํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

๋ฐ์ดํ„ฐ ์ธํ”„๋ผ ํŠธ๋žœ๋“œ

  • ํด๋ผ์šฐ๋“œ ์›จ์–ดํ•˜์šฐ์Šค - Snowflake, Google Big Query
  • Hadoop -> Databricks, Presto
  • ์‹ค์‹œ๊ฐ„ ๋น…๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ (Stream Processing)
  • ETL -> ELT
  • Dataflow ์ž๋™ํ™” (Airflow)
  • ๋ฐ์ดํ„ฐ ๋ถ„์„ ํŒ€์„ ๋‘๊ธฐ ๋ณด๋‹จ ๋ˆ„๊ตฌ๋‚˜ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋„๋ก
  • ์ค‘์•™ํ™” ๋˜๋Š” ๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ ๊ด€๋ฆฌ (access control, data book)

๋ฐ์ดํ„ฐ ์•„ํ‚คํ…์ณ ๋ถ„์•ผ๋ฅผ ํฌ๊ฒŒ 6๊ฐ€์ง€๋กœ

image

  • ์†Œ์Šค: ๋น„์ฆˆ๋‹ˆ์Šค์™€ ์šด์˜ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
  • ์ˆ˜์ง‘ ๋ฐ ๋ณ€ํ™˜: ์šด์˜ ์‹œ์Šคํ…œ์—์„œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ -> ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์Šคํ‚ค๋งˆ ๊ด€๋ฆฌ -> ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ณ€ํ™˜
  • ์ €์žฅ: ๋ฐ์ดํ„ฐ๋ฅผ ์ฟผ๋ฆฌ์™€ ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์ด ์“ธ ์ˆ˜ ์žˆ๋„๋ก ์ €์žฅ, ๋น„์šฉ๊ณผ ํ™•์žฅ์„ฑ๋ฉด์œผ๋กœ ์ตœ์ ํ™”
  • ๊ณผ๊ฑฐ&์˜ˆ์ธก: ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ ์ธ์‚ฌ์ดํŠธ ๋งŒ๋“ค๊ธฐ(Query), ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์ฟผ๋ฆฌ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ํ•„์š”์‹œ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ(Processing), ๊ณผ๊ฑฐ์— ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ฌ๋Š”์ง€ ํ˜น์€ ๋ฏธ๋ž˜์— ๋ฌด์Šจ์ผ์ด ์ผ์–ด๋‚ ์ง€(ML)
  • ์ถœ๋ ฅ: ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ๋‚ด๋ถ€์™€ ์™ธ๋ถ€ ์œ ์ €์—๊ฒŒ ์ œ๊ณต, ๋ฐ์ดํ„ฐ ๋ชจ๋ธ์„ ์šด์˜ ์‹œ์Šคํ…œ์— ์ ์šฉ

๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง ๋„๊ตฌ๋“ค

image

  • Sources, Storage, Query: ์„œ๋น„์Šค ๋ ˆ๋ฒจ ๋ณด๋‹ค๋Š” ๋กœ์šฐ ๋ ˆ๋ฒจ ๋ฌธ์ œ๋“ค์„ ํ‘ธ๋Š” ๋ถ„์•ผ
  • Ingestion & Transformation, Processing: ์ผ๋ฐ˜์ ์ธ ์—”์ง€๋‹ˆ์–ด๋ง์ด ์ง‘์ค‘ํ•˜๋Š” ๋ถ„์•ผ

2. Batch & Stream Processing

๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ ์ด๋ž€

  • ๋ฐฐ์น˜(Batch) == ์ผ๊ด„
  • ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ(Batch Processing) == ์ผ๊ด„์ฒ˜๋ฆฌ
  • ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ •ํ•ด์ง„ ์‹œ๊ฐ„์— ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ
    1. ํ•œ์ •๋œ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ
    2. ํŠน์ • ์‹œ๊ฐ„
    3. ์ผ๊ด„ ์ฒ˜๋ฆฌ
  • ์ „ํ†ต์ ์œผ๋กœ ์“ฐ์ด๋Š” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ์€ ์–ธ์ œ ์จ์•ผ๋ ๊นŒ

  • ์‹ค์‹œ๊ฐ„์„ฑ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์•„๋„ ๋  ๋•Œ
  • ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์„ ๋•Œ
  • ๋ฌด๊ฑฐ์šด ์ฒ˜๋ฆฌ๋ฅผ ํ•  ๋•Œ ex) ML ํ•™์Šต
  • ์˜ˆ์‹œ
    • ๋งค์ผ ๋‹ค์Œ 14์ผ์˜ ์ˆ˜์š”์™€ ๊ณต๊ธ‰์„ ์˜ˆ์ธก
    • ๋งค์ฃผ ์‚ฌ์ดํŠธ์—์„œ ๊ด€์‹ฌ์„ ๋ณด์ธ ์œ ์ €๋“ค์—๊ฒŒ ๋งˆ์ผ€ํŒ… ์ด๋ฉ”์ผ ์ „์†ก
    • ๋งค์ฃผ ๋ฐœํ–‰ํ•˜๋Š” ๋‰ด์Šค๋ ˆํ„ฐ
    • ๋งค์ฃผ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ•™์Šต
    • ๋งค์ผ ์•„์นจ ์›น ์Šคํฌ๋ž˜ํ•‘/ํฌ๋กค๋ง
    • ๋งค๋‹ฌ ์›”๊ธ‰ ์ง€๊ธ‰

์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ ์ด๋ž€

  • ์‹ค์‹œ๊ฐ„์œผ๋กœ ์Ÿ์•„์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ„์† ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ
  • ์ด๋ฒคํŠธ๊ฐ€ ์ƒ๊ธธ ๋•Œ ๋งˆ๋‹ค, ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ฌ ๋•Œ ๋งˆ๋‹ค ์ฒ˜๋ฆฌ
  • ๋ทธ๊ทœ์น™์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋Š” ํ™˜๊ฒฝ์ผ ๋•Œ
  • ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ด๋ฒคํŠธ๊ฐ€ ํ•œ๊บผ๋ฒˆ์— ๋“ค์–ด์˜ฌ ๋•Œ
  • ์˜ค๋žœ ์‹œ๊ฐ„ ๋™์•ˆ ์ด๋ฒคํŠธ๊ฐ€ ํ•˜๋‚˜๋„ ๋“ค์–ด์˜ค์ง€ ์•Š์„ ๋–„

๋ฐฐ์น˜ vs ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ

๋ถˆ๊ทœ์น™์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ฌ ๋–„๋ฅผ ๊ฐ€์ •

  • ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ
    • ๋ฐฐ์น˜๋‹น ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฉด์„œ ๋ฆฌ์†Œ์Šค๋ฅผ ๋น„ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค.
  • ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ
    • ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒ์„ฑ๋˜์–ด ์š”์ฒญ์ด ๋“ค์–ด๋กœ ๋•Œ ๋งˆ๋‹ค ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ์€ ์–ธ์ œ ์จ์•ผ๋ ๊นŒ

  • ์‹ค์‹œ๊ฐ„์„ฑ์„ ๋ณด์žฅํ•ด์•ผ ๋  ๋•Œ
  • ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ๋“ค์–ด์˜ฌ ๋•Œ
  • ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€๋” ๋“ค์–ด์˜ค๊ฑฐ๋‚˜ ์ง€์†์ ์œผ๋กœ ๋“ค์–ด์˜ฌ ๋•Œ
  • ๊ฐ€๋ฒผ์šด ์ฒ˜๋ฆฌ๋ฅผ ํ•  ๋•Œ (Rule-based)
  • ์˜ˆ์‹œ
    • ์‚ฌ๊ธฐ ๊ฑฐ๋ž˜ ํƒ์ง€ (Fraud Detection)
    • ์ด์ƒ ํƒ์ง€ (Anomaly Detection)
    • ์‹ค์‹œ๊ฐ„ ์•Œ๋ฆผ
    • ๋น„์ฆˆ๋‹ˆ์Šค ๋ชจ๋‹ˆํ„ฐ๋ง
    • ์‹ค์‹œ๊ฐ„ ์ˆ˜์š”/๊ณต๊ธ‰ ์ธก์ • ๋ฐ ๊ฐ€๊ฒฉ ์ฑ…์ •
    • ์‹ค์‹œ๊ฐ„ ๊ธฐ๋Šฅ์ด ๋“ค์–ด๊ฐ€๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜

Batch vs Stream ํ”Œ๋กœ์šฐ

  • ์ผ๋ฐ˜์ ์ธ ๋ฐฐ์น˜ ํ”Œ๋กœ์šฐ
    1. ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ
    2. ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ฝ์–ด์„œ ์ฒ˜๋ฆฌ
    3. ๋‹ค์‹œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ๋‹ด๊ธฐ
  • ์ผ๋ฐ˜์ ์ธ ์ŠคํŠธ๋ฆผ ์ฒ˜๋ฆฌ ํ”Œ๋กœ์šฐ
    1. ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ฌ ๋•Œ ๋งˆ๋‹ค(ingest)
    2. ์ฟผ๋ฆฌ/์ฒ˜๋ฆฌ ํ›„ State๋ฅผ ์—…๋ฐ์ดํŠธ
    3. DB์— ๋‹ด๊ธฐ

๋งˆ์ดํฌ๋กœ ๋ฐฐ์น˜๋ž€

  • ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ๊ธˆ์”ฉ ๋ชจ์•„์„œ ํ”„๋กœ์„ธ์‹ฑํ•˜๋Š” ๋ฐฉ์‹
  • Batch ํ”„๋กœ์„ธ์‹ฑ์„ ์ž˜๊ฒŒ ์ชผ๊ฐœ์„œ ์ŠคํŠธ๋ฆฌ๋ฐ์„ ํ‰๋‚ด๋‚ด๋Š” ๋ฐฉ์‹

3. Dataflow Orchestration

Orchestration ์ด๋ž€

  1. ํ…Œ์Šคํฌ ์Šค์ผ€์ค„๋ง
  2. ๋ถ„์‚ฐ ์‹คํ–‰
  3. ํ…Œ์Šคํฌ๊ฐ„ ์˜์กด์„ฑ ๊ด€๋ฆฌ

์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์ด ํ•„์š”ํ•œ ์ด์œ 

  1. ์„œ๋น„์Šค๊ฐ€ ์ปค์ง€๋ฉด์„œ ๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ์˜ ๋ณต์žก๋„๊ฐ€ ์ปค์ง
  2. ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ์ž์™€ ์ง์ ‘ ์—ฐ๊ด€๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋Š˜์–ด๋‚จ (์›Œํฌํ”Œ๋กœ์šฐ๊ฐ€ ๋ง๊ฐ€์ง€๋ฉด ์„œ๋น„์Šค๋„ ๋ง๊ฐ€์ง)
  3. ํ…Œ์Šคํฌ ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ์ค‘์š”ํ•ด์ง
  4. ํ…Œ์Šคํฌ๊ฐ„ ์˜์กด์„ฑ๋„ ์ƒ๊น€

์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ ์—†์ด ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ๋‹ค๋ฉด

image

A: 3์‹œ๊นŒ์ง€ Task 4 ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™€์•ผ ํ•˜๋Š”๋ฐ ์•„์ง ๋ชป๋ฐ›์•˜์–ด์š”. A: Task 4๋ฅผ ๋ณด๋‹ˆ ์•„์ง ์ „ ์ž‘์—…์„ ๊ธฐ๋‹ค๋ฆฌ๋Š” ๊ฒƒ ๊ฐ™์€๋ฐ, Task 2,3๋Š” ๋ˆ„๊ฐ€ ๋งŒ๋“ค์—ˆ๋‚˜์š”? A: ์ฝ”๋“œ๊ฐ€ ์–ด๋”จ๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š” B: ๋‹ค๋ฅธ ํŒ€์—์„œ ๋งŒ๋“  ๊ฒƒ ๊ฐ™์€๋ฐ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. B: Task2 ๋‹ค์‹œ ๋Œ๋ฆฌ๋‹ˆ๊นŒ ๋˜๋„ค์š”..? A & B: ???

์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์ด ์žˆ์„ ๋•Œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ๋‹ค๋ฉด

image

1. Task 2 ์—๋Ÿฌ ํ›„ ๋กœ๊ทธ ๋‚จ๊ธฐ๊ณ  ์•Œ๋ฆผ 2. ์‹คํŒจ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋”ฐ๋ผ Task 2 ๋‹ค์‹œ ์‹คํ–‰ ํ›„ ์„ฑ๊ณต 3. Task 4 ์‹คํ–‰ 4. ์„ฑ๊ณต

์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์˜ ์˜ˆ์‹œ

  • Apache Airflow

image

4. Apache Spark ํ™˜๊ฒฝ ์„ค์ •

ํ•„์š”ํ•œ ํ™˜๊ฒฝ & ํŒจํ‚ค์ง€

  1. python
  2. ์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ
  3. java
  4. spark
  5. pyspark

์•„๋‚˜์ฝ˜๋‹ค ์„ค์น˜

  • https://www.anaconda.com/products/distribution
  • ์•„๋‚˜์ฝ˜๋‹ค ์„ค์น˜ํ•˜๋ฉด python๊ณผ python์˜ ๊ธฐ๋ณธ ํŒจํ‚ค์ง€๋“ค์€ ์ž๋™์œผ๋กœ ์„ค์น˜๋˜๊ณ , python๊ณผ ์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ์„ ๋™์‹œ์— ์‰ฝ๊ฒŒ ์„ค์น˜๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

java ์„ค์น˜

brew install --cask adoptopenjdk8

scala ์„ค์น˜

brew install scala

spark ์„ค์น˜

brew install apache-spark

pyspark ์„ค์น˜

pip --version # ๊ฒฝ๋กœ๊ฐ€ anaconda ์ธ๊ฒƒ์„ ํ™•์ธ
pip install pyspark

pyspark # spark ํ„ฐ๋ฏธ๋„์ด ๋œจ๋Š”์ง€ ํ™•์ธ

์‹ค์Šต ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ

git clone https://github.com/keon/data-engineering.git

๋ชจ๋นŒ๋ฆฌํ‹ฐ ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ

ํ•„๋“œ ์ด๋ฆ„ ์„ค๋ช…
hvfhs_license_num ํšŒ์‚ฌ ๋ฉดํ—ˆ ๋ฒˆํ˜ธ
dispatching_base_num ์ง€์—ญ ๋ผ์ด์„ผ์Šค ๋ฒˆํ˜ธ
pickup_datetime ์Šน์ฐจ ์‹œ๊ฐ„
dropoff_datetime ํ•˜์ฐจ ์‹œ๊ฐ„
PULocationID ์Šน์ฐจ ์ง€์—ญ ID
DOLocationID ํ•˜์ฐจ ์ง€์—ญ ID
SR_Flag ํ•ฉ์Šน ์—ฌ๋ถ€ Flag

5. spark ๊ธฐ๋ณธ

hadoop์˜ ํŠน์ง•

  • HDFS ํŒŒ์ผ ์‹œ์Šคํ…œ
  • Yarn ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ
  • Map Reduce ์—ฐ์‚ฐ ์—”์ง„ -> Spark๊ฐ€ ์ด๊ฒƒ์„ ๋Œ€์ฒดํ•œ๋‹ค.

spark์˜ ํŠน์ง•

  • ๋น ๋ฅด๋‹ค = ๋น…๋ฐ์ดํ„ฐ์˜ In-Memory ์—ฐ์‚ฐ
  • ๋…ธ๋“œ๋Š” ํ•„์š”์— ๋”ฐ๋ผ ๊ณ„์† ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋‹ค.
  • ์ˆ˜ํ‰์  ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
  • Hadoop MapReduce ๋ณด๋‹ค ๋น ๋ฅด๋‹ค
    • ๋ฉ”๋ชจ๋ฆฌ ์ƒ์—์„  100๋ฐฐ
    • ๋””์Šคํฌ ์ƒ์—์„  10๋ฐฐ
  • Lazy Evaluation
    • ํƒœ์Šคํฌ๋ฅผ ์ •์˜ํ•  ๋•Œ๋Š” ์—ฐ์‚ฐ์„ ํ•˜์ง€ ์•Š๋‹ค๊ฐ€ ๊ฒฐ๊ณผ๊ฐ€ ํ•„์š”ํ•  ๋•Œ ์—ฐ์‚ฐํ•œ๋‹ค.
    • ๊ธฐ๋‹ค๋ฆฌ๋ฉด์„œ ์—ฐ์‚ฐ ๊ณผ์ •์„ ์ตœ์ ํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค.

spark Cluster

  • Driver Program, Cluster Manager, Worker Node ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
  • Driver Program: ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ์ปดํ“จํ„ฐ, python | java | scala ์™€ ๊ฐ™์€ script๋กœ task์„ ์ •์˜ํ•œ๋‹ค.
  • Cluster Manager: ์ •์˜๋œ task ์ฆ‰ ์ผ๊ฑฐ๋ฆฌ๋ฅผ ๋ถ„๋ฐฐ ํ•œ๋‹ค.
    • hadoop์—์„œ๋Š” yarn cluster manager์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
    • aws์—์„œ๋Š” elastic mapreduce manager์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Worker Node
    • 1CPU์ฝ”์–ด ๋‹น 1Node ๋ฐฐ์น˜
    • ์ธ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•œ๋‹ค.

๊ฐœ์ธ ์ปดํ“จํ„ฐ์—์„œ๋Š” spark๊ฐ€ ๋А๋ฆฐ ์ด์œ 

  • spark๋Š” ํ™•์žฅ์„ฑ์„ ๊ณ ๋ คํ•ด์„œ ์„ค๊ณ„ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ

spark์˜ ํ•ต์‹ฌ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ RDD

  • Resilient Distributed Dataset (RDD)
  • ์—ฌ๋Ÿฌ ๋ถ„์‚ฐ ๋…ธ๋“œ์— ๊ฑธ์ณ์„œ ์ €์žฅ
  • ๋ณ€๊ฒฝ์ด ๋ถˆ๊ฐ€๋Šฅ
  • ์—ฌ๋Ÿฌ๊ฐœ์˜ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋ถ„๋ฆฌ

Pandas vs Spark

Pandas Spark
1๊ฐœ์˜ ๋…ธ๋“œ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋…ธ๋“œ
Eager Execution - ์ฝ”๋“œ๊ฐ€ ๋ฐ”๋กœ ์‹คํ–‰ Lazy Execution - ์‹คํ–‰์ด ํ•„์š”ํ•  ๋•Œ ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆผ
์ปดํ“จํ„ฐ ํ•˜๋“œ์›จ์–ด์— ์ œํ•œ์„ ๋ฐ›์Œ ์ˆ˜ํ‰์  ํ™•์žฅ์ด ๊ฐ€๋Šฅ
In-Memory ์—ฐ์‚ฐ In-Memory ์—ฐ์‚ฐ
Mutable Data Immutable Data

Spark ๋ฒ„์ „๋ณ„ ํŠน์ง•

  • Spark 1.0
    • 2014 ๋…„ ์ •์‹ ๋ฐœํ‘œ
    • RDD๋ฅผ ์ด์šฉํ•œ ์ธ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ ๋ฐฉ์‹
    • DataFrame (V1.3)
    • Project Tungsten - ์—”์ง„ ์—…๊ทธ๋ ˆ์ด๋“œ๋กœ ๋ฉ”๋ชจ๋ฆฌ์™€ CPU ํšจ์œจ ํšŒ์ ํ™”
  • Spark 2.0
    • 2016 ๋…„ ๋ฐœํ‘œ
    • ๋‹จ์ˆœํ™” ๋˜๊ณ  ์„ฑ๋Šฅ ๊ฐœ์„ 
    • Structed Streaming
    • Dataset ์ด๋ผ๋Š” DataFrame์˜ ํ™•์žฅํ˜• ์ž๋ฃŒ๊ตฌ์กฐ ๋“ฑ์žฅ
    • Catalyst Optimizer ํ”„๋กœ์ ํŠธ - ์–ธ์–ด์— ์ƒ๊ด€์—†์ด ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๋ณด์žฅ - Scala, Java, Python, R
  • Spark 3.0
    • 2020 ๋…„ ๋ฐœํ‘œ
    • Mlib ๊ธฐ๋Šฅ ์ถ”๊ฐ€
    • Spark SQL ๊ธฐ๋Šฅ ์ถ”๊ฐ€
    • Spark 2.4๋ณด๋‹ค ์•ฝ 2๋ฐฐ ๋นจ๋ผ์ง - Adaptive execution, Dynamic partition pruning
    • PySpark ์‚ฌ์šฉ์„ฑ ๊ฐœ์„ 
    • ๋”ฅ๋Ÿฌ๋‹ ์ง€์› ๊ฐ•ํ™” - GPU๋…ธ๋“œ ์ง€์›, ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์™€ ์—ฐ๊ณ„ ๊ฐ€๋Šฅ
    • GraphX - ๋ถ„์‚ฐ ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ
    • Python2 ์ง€์›์ด ๋Š๊น€
    • ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค ์ง€์› ๊ฐ•ํ™”
  • ์ƒˆ ๊ธฐ๋Šฅ์ด ์ถ”๊ฐ€๋˜๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๊ณ  ์žˆ์ง€๋งŒ, ๊ทผ๋ณธ์€ ๋ฐ”๋€Œ์ง€ ์•Š๋Š”๋‹ค.

Spark ๊ตฌ์„ฑ

  • Spark Core
  • Spark SQL
  • Spark Streaming
  • MLlib
  • GraphX

RDD๋ž€

lines = sc.textFile("") # lines == RDD 
  • Resilient Distributed Dataset

RDD ํŠน์ง• - 1. ๋ฐ์ดํ„ฐ ์ถ”์ƒํ™”

  • ๋ฐ์ดํ„ฐ๋Š” ํด๋Ÿฌ์Šคํ„ฐ์— ํฉ์–ด์ ธ์žˆ์ง€๋งŒ ํ•˜๋‚˜์˜ ํŒŒ์ผ์ธ๊ฒƒ ์ฒ˜๋Ÿผ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

RDD ํŠน์ง• - 2. ํƒ„๋ ฅ์  & ๋ถˆ๋ณ€

  • ํƒ„๋ ฅ์ ์ด๊ณ  ๋ถˆ๋ณ€ํ•˜๋Š” ์„ฑ์งˆ์ด ์žˆ๋‹ค (Resilient & Immutable)
  • ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ๊ตฐ๋ฐ์„œ ์—ฐ์‚ฐ์„ ํ•˜๋‹ค๊ฐ€ ์—ฌ๋Ÿฌ ๋…ธ๋“œ ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ง๊ฐ€์ง„๋‹ค๋ฉด? (๋„คํŠธ์›Œํฌ ์žฅ์•  | ํ•˜๋“œ์›จ์–ด / ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ | ์•Œ์ˆ˜์—†๋Š” ๊ฐ–๊ฐ€์ง€ ์ด์œ  ๋–„๋ฌธ์—)
  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถˆ๋ณ€(Imuutable) ํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ์ผ์–ด๋‚  ๋•Œ ๋ณต์›์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.
  • RDD1์ด ๋ณ€ํ™˜์„ ๊ฑฐ์น˜๋ฉด, RDD1์ด ๋ฐ”๋€Œ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ์ƒˆ๋กœ์šด RDD2๊ฐ€ ๋งŒ๋“ค์–ด์ง„๋‹ค. (Imuutable)
  • ๋ณ€ํ™˜์„ ๊ฑฐ์น  ๋•Œ ๋งˆ๋‹ค ์—ฐ์‚ฐ์˜ ๊ธฐ๋ก์ด ๋‚จ๋Š”๋‹ค.
  • RDD์˜ ๋ณ€ํ™˜ ๊ณผ์ •์€ ํ•˜๋‚˜์˜ ๋น„์ˆœํ™˜ ๊ทธ๋ž˜ํ”„(Acyclic Graph)๋กœ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด ํŠน์ง• ๋•๋ถ„์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ๊ฒฝ์šฐ์— ์‰ฝ๊ฒŒ ์ „ RDD๋กœ ๋Œ์•„๊ฐˆ ์ˆ˜ ์žˆ๋‹ค.
  • Node 1์ด ์—ฐ์‚ฐ ์ค‘ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ๋ฉด ๋‹ค์‹œ ๋ณต์› ํ›„ Node2 ์—์„œ ์—ฐ์‚ฐํ•˜๋ฉด ๋œ๋‹ค. (Resillient)

RDD ํŠน์ง• - 3. Type-safe

  • ์ปดํŒŒ์ผ์‹œ Type์„ ํŒ๋ณ„ํ•  ์ˆ˜ ์žˆ์–ด ๋ฌธ์ œ๋ฅผ ์ผ์ฐ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋‹ค.

RDD ํŠน์ง• - 4. Unstructured / Structured Data

  • Structured / Unstructured ๋‘˜๋‹ค ๋‹ด์„ ์ˆ˜ ์žˆ๋‹ค.
  • Unstructured Data - ๋กœ๊ทธ or ์ž์—ฐ์–ด
  • Structured Data - RDB or DataFrame

RDD ํŠน์ง• - 5. Lazy

  • ๊ฒฐ๊ณผ๊ฐ€ ํ•„์š”ํ•  ๋–„ ๊นŒ์ง€ ์—ฐ์‚ฐ์„ ํ•˜์ง€ ์•Š๋Š”๋‹ค
  • ๋‘๊ฐ€์ง€ ์—ฐ์‚ฐ์ด ์žˆ๋Š”๋ฐ, T = ๋ณ€ํ™˜ A = ์•ก์…˜, ์˜ˆ) RDD1 -> T -> RDD2 -> T -> RDD3 -> A --> RDD4
  • ์•ก์…˜์„ ํ•  ๋•Œ ๊นŒ์ง€ ๋ณ€ํ™˜์€ ์‹คํ–‰๋˜์ง€ ์•Š๋Š”๋‹ค.
  • Action์„ ๋งŒ๋‚˜๋ฉด ๊ทธ๋•Œ ๋ณ€ํ™˜(T) ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•œ๋‹ค.

Spark Operation

  • Spark Operation = Transform + Action

RDD๋ฅผ ์“ฐ๋Š” ์ด์œ 

  • ์œ ์—ฐํ•˜๋‹ค
  • ์งง์€ ์ฝ”๋“œ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š”๊ฒŒ ๋งŽ๋‹ค
  • ๊ฐœ๋ฐœํ•  ๋•Œ ๋ฌด์—‡๋ณด๋‹ค๋Š” ์–ด๋–ป๊ฒŒ์— ๋Œ€ํ•ด ๋” ์ƒ๊ฐํ•˜๊ฒŒ ํ•œ๋‹ค (how-to)
    • ๊ฒŒ์œผ๋ฅธ ์—ฐ์‚ฐ ๋•๋ถ„์— ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜๋ ์ง€ ์ƒ๊ฐํ•˜๊ฒŒ ๋œ๋‹ค
    • ๋ฐ์ดํ„ฐ๊ฐ€ ์ง€๋‚˜๊ฐˆ ๊ธธ์„ ๋‹ฆ๋Š” ๋А๋‚Œ

์šฐ๋ฒ„ ํŠธ๋ฆฝ์ˆ˜ ์„ธ๊ธฐ

spark-submit count_trips.py # ํŠธ๋ฆฝ ์ˆ˜ ์„ธ๊ธฐ 

python3 visualiza_trips_date.py # ์ฐจํŠธ๋กœ ๊ทธ๋ฆฌ๊ธฐ 

jupyter notebook . # ์ฃผํ”ผํ„ฐ๋กœ count_trips.ipynb ์—ด๊ธฐ -> ์ฝ”๋“œ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…๋“ค 

6. ๋ณ‘๋Ÿด์ฒ˜๋ฆฌ์™€ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ

์ผ๋ฐ˜ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ

RDD.map(<task>)
  1. ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ๋กœ ์ชผ๊ฐœ๊ณ 
  2. ์—ฌ๋Ÿฌ ์“ฐ๋ ˆ๋“œ์—์„œ ๊ฐ์ž task๋ฅผ ์ ์šฉ
  3. ๊ฐ์ž ๋งŒ๋“  ๊ฒฐ๊ณผ๊ฐ’์„ ํ•ฉ์น˜๋Š” ๊ณผ์ •

๋ถ„์‚ฐ๋œ ํ™˜๊ฒฝ์—์„œ์˜ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ

  1. ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ๋กœ ์ชผ๊ฐœ์„œ ์—ฌ๋Ÿฌ ๋…ธ๋“œ๋กœ ๋ณด๋‚ธ๋‹ค.
  2. ์—ฌ๋Ÿฌ ๋…ธ๋“œ์—์„œ ๊ฐ์ž ๋…๋ฆฝ์ ์œผ๋กœ task๋ฅผ ์ ์šฉ
  3. ๊ฐ์ž ๋งŒ๋“  ๊ฒฐ๊ณผ๊ฐ’์„ ํ•ฉ์น˜๋Š” ๊ณผ์ •
  • ๋…ธ๋“œ๊ฐ„ ํ†ต์‹  ๊ฐ™์ด ์‹ ๊ฒฝ์จ์•ผ๋  ๊ฒƒ์ด ๋Š˜์–ด๋‚œ๋‹ค
  • Spark๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ถ„์‚ฐ๋œ ํ™˜๊ฒฝ์—์„œ๋„ ์ผ๋ฐ˜์ ์ธ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋“ฏ์ด ์ฝ”๋“œ๋ฅผ ์งœ๋Š”๊ฒŒ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • Spark๋Š” ๋ถ„์‚ฐ๋œ ํ™˜๊ฒฝ์—์„œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•ด์„œ ์ถ”์ƒํ™” ์‹œ์ผœ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•œ๊ฒƒ์ด๋‹ค. (RDD)
  • ๊ทธ๋ ‡๋‹ค๊ณ  ์ƒ๊ฐ ์—†์ด spark ์ฝ”๋”ฉ์„ ํ•˜๋ฉด ์„ฑ๋Šฅ์„ ๋Œ์–ด๋‚ด๊ธฐ๋Š” ํž˜๋“ค๋‹ค. (๋…ธ๋“œ๊ฐ„ ํ†ต์‹  ์†๋„๋ฅผ ์‹ ๊ฒฝ์จ์•ผ ํ•จ)

๋ถ„์‚ฐ์ฒ˜๋ฆฌ ๋ฌธ์ œ

  • ๋ถ„์‚ฐ์ฒ˜๋ฆฌ๋กœ ๋„˜์–ด๊ฐ€๋ฉด์„œ ์‹ ๊ฒฝ์จ์•ผ๋  ๋ฌธ์ œ๊ฐ€ ๋งŽ์•„์กŒ๋‹ค.
  • ๋ถ€๋ถ„ ์‹คํŒจ - ๋…ธ๋“œ ๋ช‡๊ฐœ๊ฐ€ ํ”„๋กœ๊ทธ๋žจ๊ณผ ์ƒ๊ด€ ์—†๋Š” ์ด์œ ๋กœ ์ธํ•ด ์‹คํŒจ
  • ์†๋„ - ๋งŽ์€ ๋„คํŠธ์›Œํฌ ํ†ต์‹ ์„ ํ•„์š”๋กœ ํ•˜๋Š” ์ž‘์—…์€ ์†๋„๊ฐ€ ์ €ํ•˜
RDD.map(A).filter(B).reduceByKey(C).take(100) # 1 
RDD.map(A).reduceByKey(C).filter(B).take(100) # 2 

"""
1๋ฒˆ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์˜ ์ฝ”๋“œ์ด๋‹ค.
reduceByKey๋Š” ์—ฌ๋Ÿฌ๋…ธ๋“œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ ๋–„๋ฌธ์— ํ†ต์‹ ์„ ํ•„์š”๋กœ ํ•˜๋Š”๋ฐ,
filter๋ฅผ ํ†ตํ•ด์„œ ๋ฐ์ดํ„ฐ์–‘์„ ์ค„์ด๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š”๊ฒƒ์ด ํšจ์œจ์ ์ด๊ธฐ ๋•Œ๋ฌธ

๋ฉ”๋ชจ๋ฆฌ > ๋””์Šคํฌ > ๋„คํŠธ์›Œํฌ ์ˆœ์œผ๋กœ ๋น ๋ฅด๊ธฐ๋–„๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
๋„คํŠธ์›Œํฌ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์— ๋น„ํ•ด 100๋งŒ๋ฐฐ ์ •๋„ ๋А๋ฆฌ๋‹ค 
"""

7. RDD

Key-Value RDD ์ด๋ž€

  • Structured Data๋ฅผ Spark์™€ ์—ฐ๊ณ„ํ•ด์„œ ์“ธ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๋„๊ตฌ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.
  • Key์™€ Value ์Œ์„ ๊ฐ–๋Š” Key-Value RDD
  • (Key, Value) ์Œ์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— Pairs RDD๋ผ๋„๊ณ  ๋ถˆ๋ฆผ
  • ๊ฐ„๋‹จํ•œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์ฒ˜๋Ÿผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.

Single-Value RDD vs Key-Value RDD

  • Single-Value RDD: ํ…Œ์ŠคํŠธ์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด ์ˆ˜ ์„ธ๊ธฐ (๋‚ ์งœ) -> 1์ฐจ์› ์ ์ธ ์—ฐ์‚ฐ
  • Key-Value RDD: ๋„ทํ”Œ๋ฆญ์Šค ๋“œ๋ผ๋งˆ๊ฐ€ ๋ฐ›์€ ํ‰๊ท  ๋ณ„์  (๋‚ ์งœ, ์Šน๊ฐ์ˆ˜) -> ๊ณ ์ฐจ์› ์ ์ธ ์—ฐ์‚ฐ

Key-Value RDD ๊ฐœ๋…

  • Key์™€ Value ์Œ์„ ๊ฐ€์ง„๋‹ค
    • ์˜ˆ) ์ง€์—ญ ID ๋ณ„๋กœ ํƒ์‹œ ์šดํ–‰์ˆ˜๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
      • Key: ์ง€์—ญ ID
      • Value: ์šดํ–‰ ์ˆ˜
    • ๋‹ค๋ฅธ์˜ˆ) ๋“œ๋ผ๋งˆ ๋ณ„๋กœ ๋ณ„์  ์ˆ˜ ๋ชจ์•„๋ณด๊ธฐ, ํ‰๊ท  ๊ตฌํ•˜๊ธฐ
    • ๋‹ค๋ฅธ์˜ˆ) ์ด์ปค๋จธ์Šค ์‚ฌ์ดํŠธ์—์„œ ์ƒํ’ˆ๋‹น ๋ณ„ ํ‰์  ๊ตฌํ•˜๊ธฐ
  • ์ฝ”๋“œ์ƒ์œผ๋กœ๋Š” ๋งŽ์ด ๋‹ค๋ฅด์ง€ ์•Š๋‹ค
pairs = rdd.map(lambda x: (x,1))

"""
[ 
  ์ง€์—ญ
  ์ง€์—ญ
]

[ 
  (์ง€์—ญ, 1)
  (์ง€์—ญ, 1)
]
"""

Key-Value RDD - Reduction

  • reduceByKey() - ํ‚ค ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ํ…Œ์Šคํฌ ์ฒ˜๋ฆฌ
  • groupByKey() - ํ‚ค ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๋ฒจ๋ฅ˜๋ฅผ ๋ฌถ๋Š”๋‹ค
  • sortByKey() - ํ‚ค ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ
  • keys() - ํ‚ค ๊ฐ’ ์ถ”์ถœ
  • values() - ๋ฒจ๋ฅ˜๊ฐ’ ์ถ”์ถœ
pairs = rdd.map(lambda x: (x,1))
count = pairs.reduceByKey(lambda a, b,: a+b) 

"""
์งœ์žฅ๋ฉด 
์งœ์žฅ๋ฉด
์งฌ๋ฝ•
์งฌ๋ฝ• 

(์งœ์žฅ๋ฉด, 1)
(์งœ์žฅ๋ฉด, 1)
(์งฌ๋ฝ•, 1)
(์งฌ๋ฝ•, 1)

(์งœ์žฅ๋ฉด, 2)
(์งฌ๋ฝ•, 2)
"""

Key-Value RDD - Join

  • join
  • rightOuterJoin
  • leftOuterJoin
  • subtractByKey

Key-Value RDD - Mapping values

  • key๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š๋Š”๊ฒฝ์šฐ map()๋Œ€์‹  value๋งŒ ๋‹ค๋ฃจ๋Š” mapValues() ํ•จ์ˆ˜๋ฅผ ์“ฐ๋Š”๊ฒŒ ์ข‹๋‹ค
    • spark ๋‚ด๋ถ€์—์„œ ํŒŒํ‹ฐ์…˜์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์–ด์„œ ๋”์šฑ ํšจ์œจ์ ์ด๋‹ค.
  • mapValues(), flatMapValues() ๋‘๊ฐœ๋‹ค Value๋งŒ ๋‹ค๋ฃจ๋Š” ์—ฐ์‚ฐ๋“ค์ด๊ณ  RDD์—์„œ key๋Š” ์œ ์ง€๋จ

Key-Value RDD - ์˜ˆ์‹œ

1-spark/category-review-average.ipynb

RDD Transformations vs Actions

  • Transformation
    • ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ์ƒˆ๋กœ์šด RDD๋ฅผ ๋ฐ˜ํ™˜
    • ์ง€์—ฐ ์‹คํ–‰ - Lazy Execution
    • map()
    • flatMap()
    • filter()
    • distinct()
    • reduceByKey()
    • groupByKey()
    • mapValues()
    • flatMapValues()
    • sortByKey()
  • Actions
    • ๊ฒฐ๊ณผ๊ฐ’์„ ์—ฐ์‚ฐํ•˜์—ฌ ์ถœ๋ ฅํ•˜๊ฑฐ๋‚˜ ์ €์žฅ (python object ๋ฐ˜ํ™˜ )
    • ์ฆ‰์‹œ ์‹คํ–‰ - Eager Execution
    • collect()
    • count()
    • countByValues()
    • take()
    • top()
    • reduce()
    • fold()
    • foreach()
1-spark/rdd_transformations_actions.ipynb

Transformations

  • transformations = Narrow + Wide

Narrow Transformations

  • 1:1 ๋ณ€ํ™˜
  • filter(), map(), flatMap(), sample(), union()
  • 1์—ด์„ ์กฐ์ž‘ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค๋ฅธ ์—ด/ํŒŒํ‹ฐ์…˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์“ธ ํ•„์š”๊ฐ€ ์—†์Œ.

Wide Transformations

  • Shuffling
  • Intersection and join, distinct, cartesian, reduceByKey(), groupByKey()
  • ์•„์›ƒํ’‹ RDD์˜ ํŒŒํ‹ฐ์…˜์— ๋‹ค๋ฅธ ํŒŒํ‹ฐ์…˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ์Œ
  • ์„ฑ๋Šฅ์ƒ ๋งŽ์€ ๋ฆฌ์†Œ์Šค๋ฅผ ์š”๊ตฌํ•˜๊ฒŒ ๋˜๊ณ , ์ตœ์†Œํ™”ํ•˜๊ณ  ์ตœ์ ํ™”๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

Lazy ์—ฐ์‚ฐ์˜ ์žฅ์ 

  • ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. (๋””์Šคํฌ, ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ์„ ์ตœ์†Œํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค.)
  • ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” task๋Š” ๋ฐ˜๋ณต๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„์„œ(ex ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต), Lazy๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋น„ํšจ์œจ์ ์ธ๋ถ€๋ถ„์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
    • Task -> Disk -> Task -> Disk ๋กœ ์ž‘์—…์„ ํ•˜๋ฉด Disk์— ์ž์ฃผ๋“ค๋ฅด๊ฒŒ ๋˜์–ด์„œ ๋น„ํšจ์œจ์ ์ด๋‹ค.
    • Task -> Task ๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ in-memory๋กœ ์ฃผ๊ณ ๋ฐ›์œผ๋ฉด ํšจ์œจ์ ์ด๋‹ค.
    • in-memory๋กœ ์ฃผ๊ณ  ๋ฐ›์œผ๋ ค๋ฉด ์–ด๋–ค ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ๋‚จ๊ฒจ์•ผ ํ•  ์ง€ ์•Œ์•„์•ผ ๊ฐ€๋Šฅํ•˜๋‹ค.
    • Transformations๋Š” ์ง€์—ฐ ์‹คํ–‰ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘˜ ์ˆ˜ ์žˆ๋‹ค.

Storage Level

  • MEMORY_ONLY: ๋ฉ”๋ชจ๋ฆฌ์—๋งŒ ์ €์žฅ
  • MEMORY_AND_DISK: ๋ฉ”๋ชจ๋ฆฌ์™€ ๋””์Šคํฌ ๋ชจ๋‘ ์ €์žฅ, ๋ฉ”๋ชจ๋ฆฌ์— ์—†์„๊ฒฝ์šฐ ๋””์Šคํฌ๊นŒ์ง€ ๋ณด๊ฒ ๋‹ค.
  • MEMORY_ONLY_SER: ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์•„๋ผ๊ธฐ ์œ„ํ•ด์„œ serialize (๊บผ๋‚ด์˜ฌ ๋•Œ deserialize ๊ณผ์ •์ด ์ถ”๊ฐ€๋จ)
  • MEMORY_AND_DISK_SER: ๋ฉ”๋ชจ๋ฆฌ์™€ ๋””์Šคํฌ์— serialize
  • DISK_ONLY: ๋””์Šคํฌ์—๋งŒ

Cache & Persist

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ๋‚จ๊ฒจ๋‘๊ณ  ์‹ถ์„ ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜
categoryReviews = filtered_lines.map(parse)

result1 = categoryReviews.take(10)
result2 = categoryReviews.mapValues(lambda x: (x,1)).collect()

# categoryReviews๋Š” result1๊ณผ result2๋ฅผ ๋งŒ๋“ค๋ฉด์„œ 2๋ฒˆ ๋งŒ๋“ค์–ด์ง.
# .persist()๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘๊ณ  ์“ธ ์ˆ˜ ์žˆ์Œ 
# categoryReviews = filtered_lines.map(parse).cache()
  • Cache
    • ๋””ํดํŠธ Storage Level ์‚ฌ์šฉ
    • RDD: MEMORY_ONLY
    • DF: MEMORY_AND_DISK
  • Persist
    • Storage Level์„ ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š”๋Œ€๋กœ ์ง€์ • ๊ฐ€๋Šฅ

Master Worker Topology

  • spark๋Š” Master Worker Topology๋กœ ๊ตฌ์„ฑ ๋˜์–ด ์žˆ๋‹ค.
  • ์ŠคํŒŒํฌ๋ฅผ ์“ฐ๋ฉด์„œ ์žŠ์ง€ ๋ง์•„์•ผ ํ•  ์ 
    • ํ•ญ์ƒ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๊ณณ์— ๋ถ„์‚ฐ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฒƒ
    • ๊ฐ™์€ ์—ฐ์‚ฐ์ด์–ด๋„ ์—ฌ๋Ÿฌ ๋…ธ๋“œ์— ๊ฑธ์ณ์„œ ์‹คํ–‰ ๋œ๋‹ค๋Š” ์ 

Spark ๋™์ž‘ ๊ณผ์ •

image

  1. Driver Program์ด Spark Context๋ฅผ ์ƒ์„ฑํ•ด์„œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋งŒ๋“ ๋‹ค.
  2. Spark Context๊ฐ€ Cluster Manager์— ์—ฐ๊ฒฐ์„ ํ•œ๋‹ค.
  3. Cluster Manager๊ฐ€ ์ž์›๋“ค์„ ํ• ๋‹นํ•œ๋‹ค.
  4. Cluster Manager๊ฐ€ ํด๋Ÿฌ์Šคํ„ฐ์— ์žˆ๋Š” ๋…ธ๋“œ๋“ค์˜ Executor๋ฅผ ์ˆ˜์ง‘ํ•œ๋‹ค.
  5. Executor๋“ค์€ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•œ๋‹ค.
  6. Spark Context๊ฐ€ Executor ๋“ค์—๊ฒŒ ์‹คํ–‰ํ•  Task๋ฅผ ์ „์†กํ•œ๋‹ค์Œ์—
  7. ์‹คํ–‰๋œ Task๋“ค์ด ๊ฒฐ๊ณผ๊ฐ’๋“ค์„ ๋‚ด๋ฑ‰๋Š”๋ฐ, ์ด๊ฒƒ์„ Driver Program์— ๋ณด๋‚ด๊ฒŒ ๋œ๋‹ค.
RDD.foreach(lambda x: print(x)) 
"""
Driver Program์—์„œ ์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ์‹คํ–‰๊ฒฐ๊ณผ๊ฐ€ ์•„๋ฌด๊ฒƒ๋„ ๋‚˜์˜ค์ง€ ์•Š๋Š”๋‹ค.
์™œ๋ƒํ•˜๋ฉด foreach๊ฐ€ ์•ก์…˜์ด๊ธฐ ๋•Œ๋ฌธ์—, Driver๊ฐ€ ์•„๋‹Œ Executor์—์„œ ๋ฐ”๋กœ ์‹คํ–‰ ๋˜๊ธฐ ๋–„๋ฌธ์ด๋‹ค.
"""
foods = sc.parallelize(["์งœ์žฅ๋ฉด","๋งˆ๋ผํƒ•", ...])
three = foods.take(3)

"""
three ๊ฒฐ๊ณผ๊ฐ’์€ Driver Program์— ์ €์žฅ ๋œ๋‹ค.
์ผ๋ฐ˜์ ์œผ๋กœ ์•ก์…˜์€ Driver Pgogram์ด Worker Node๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›๋Š” ๊ฒƒ ๊นŒ์ง€ ํฌํ•จ ํ•œ๋‹ค.
๊ฒฐ๊ตญ, Executor์—๊ฒŒ take ์—ฐ์‚ฐ์„ ์‹œํ–‰ํ•˜๋ผ๊ณ  ๋ช…๋ นํ•˜๊ณ , ๊ทธ๊ฒฐ๊ณผ๋ฅผ driver node์—๊ฒŒ ๋Œ๋ ค๋‹ฌ๋ผ๊ณ  ์š”์ฒญํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 
"""

Reduction Operations

  • Reduction: ์š”์†Œ๋“ค์„ ๋ชจ์•„์„œ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ์ž‘์—…, ๋งŽ์€ Spark์˜ ์—ฐ์‚ฐ๋“ค์ด reduction์ด๋‹ค.
  • ๋Œ€๋ถ€๋ถ„์˜ Action์€ Reduction์ด๋‹ค.
  • Reduction: ๊ทผ์ ‘ํ•˜๋Š” ์š”์†Œ๋“ค์„ ๋ชจ์•„์„œ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋กœ ๋งŒ๋“œ๋Š” ์ผ
  • ํŒŒ์ผ ์ €์žฅ, collect() ๋“ฑ๊ณผ ๊ฐ™์ด Reduction์ด ์•„๋‹Œ ์•ก์…˜๋„ ์žˆ๋‹ค.

Parallel Reduction

  • ํŒŒํ‹ฐ์…˜ ๋งˆ๋‹ค ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด์•ผ ๋ถ„์‚ฐ๋œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • ํŒŒํ‹ฐ์…˜์ด ๋‹ค๋ฅธ ํŒŒํ‹ฐ์…˜์˜ ๊ฒฐ๊ณผ์— ์˜์กดํ•˜๊ฒŒ ๋˜๋ฉด, ํ•œ ํ…Œ์Šคํฌ๊ฐ€ ์ „ ํ…Œ์Šคํฌ๋ฅผ ๊ธฐ๋‹ค๋ ค์•ผ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ž‘์—…์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†๊ฒŒ ๋˜๊ณ  ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•ด์ง€๋ฏ€๋กœ ๋ถ„์‚ฐ์— ์˜๋ฏธ๊ฐ€ ์—†์–ด์ง„๋‹ค.

Reduction Actions

  • ๋Œ€ํ‘œ์ ์ธ Reduction Actions: Reduce, Fold, GroupBy, Aggregate

Reduce

from operator import add
sc.parallelize([1,2,3,4,5]).reduce(add) 
# 15 

Partition

  • ํŒŒํ‹ฐ์…˜์ด ์–ด๋–ป๊ฒŒ ๋‚˜๋‰ ์ง€ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์ •ํ™•ํžˆ ์•Œ๊ธฐ ์–ด๋ ต๋‹ค.
  • ์—ฐ์‚ฐ์˜ ์ˆœ์„œ์™€ ์ƒ๊ด€ ์—†์ด ๊ฒฐ๊ณผ ๊ฐ’์„ ๋ณด์žฅํ•˜๋ ค๋ฉด
    • ๊ตํ™˜ ๋ฒ•์น™ (ab = ba)
    • ๊ฒฐํ•ฉ ๋ฒ•์น™ (ab)c = a(bc)
# ํŒŒํ‹ฐ์…˜์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๊ฒŒ ๋œ๋‹ค.
# ๋ถ„์‚ฐ๋œ ํŒŒํ‹ฐ์…˜๋“ค์˜ ์—ฐ์‚ฐ๊ณผ ํ•ฉ์น˜๋Š” ๋ถ€๋ถ„์„ ๋‚˜๋ˆ ์„œ ์ƒ๊ฐํ•ด์•ผ ํ•œ๋‹ค. 

>>> sc.parallelize([1,2,3,4]).reduce(lambda x,y: (x*2)+y) # ํŒŒํ‹ฐ์…˜ ์ง€์ • X 
26
>>> sc.parallelize([1,2,3,4],1).reduce(lambda x,y: (x*2)+y) # ํŒŒํ‹ฐ์…˜ 1๊ฐœ๋กœ ์ง€์ •
26
>>> sc.parallelize([1,2,3,4],2).reduce(lambda x,y: (x*2)+y) # ํŒŒํ‹ฐ์…˜ 2๊ฐœ๋กœ ์ง€์ •
18
>>> sc.parallelize([1,2,3,4],3).reduce(lambda x,y: (x*2)+y) # ํŒŒํ‹ฐ์…˜ 3๊ฐœ๋กœ ์ง€์ •
18
>>> sc.parallelize([1,2,3,4],4).reduce(lambda x,y: (x*2)+y) # ํŒŒํ‹ฐ์…˜ 4๊ฐœ๋กœ ์ง€์ •
26

"""
(1,2,3,4) -> ((1*2+2)*2+3)*2+4=26 # ํŒŒํ‹ฐ์…˜ 1
(1,2)(3,4) -> ((1*2+2)*2 + (3*2)+4) = 18 # ํŒŒํ‹ฐ์…˜ 2
"""

Fold

  • Reduce์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, Fold๋Š” ์‹œ์ž‘๊ฐ’์„ ์„ค์ •ํ•ด์ค€๋‹ค ๋ผ๋Š” ๋ถ€๋ถ„๋งŒ ๋‹ค๋ฆ„.
from operator import add
sc.parallelize([1,2,3,4,5]).fold(0, add) 
# 15 

Fold & Partition

rdd = sc.parallelize([2,3,4],4)
rdd.reduce(lambda x, y: x*y) # 24
rdd.fold(1, lambda x, y: x*y) # 24

rdd.reduce(lambda x, y: x+y) # 9 (0+2+3+4 =9)
rdd.fold(1, lambda x, y: x+y) # 14 (1+1) + (1+2) + (1+3) + (1+4) = 14 , ๊ฐ ํŒŒํ‹ฐ์…˜์˜ ์‹œ์ž‘๊ฐ’์ด 1

GroupBy

rdd = sc.parallelize([1,1,2,3,5,8])
result = rdd.groupBy(lambda x: x % 2).collect()
sorted([(x, sorted(y)) for (x,y) in result])
# [(0, [2,8]), (1, [1,1,3,5])]

Aggregate

  • RDD ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ Action ๊ฒฐ๊ณผ ํƒ€์ž…์ด ๋‹ค๋ฅผ ๊ฒฝ์šฐ ์‚ฌ์šฉ
  • ํŒŒํ‹ฐ์…˜ ๋‹จ์œ„์˜ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์น˜๋Š” ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค
  • RDD.aggregate(zeroValue, seqOp, combOp)
    • zeroValue: ๊ฐ ํŒŒํ‹ฐ์…˜์—์„œ ๋ˆ„์ ํ•  ์‹œ์ž‘ ๊ฐ’
    • seqOp: ํƒ€์ž… ๋ณ€๊ฒฝ ํ•จ์ˆ˜
    • combOp: ํ•ฉ์น˜๋Š” ํ•จ์ˆ˜
  • ๋งŽ์ด ์“ฐ์ด๋Š” reduction action
  • ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ ์ž‘์—…์€ ํฌ๊ณ  ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ํƒ€์ž… -> ์ •์ œ๋œ ๋ฐ์ดํ„ฐ
seqOp = (lambda x,y: (x[0] + y, x[1] + 1))
combOp = (lambda x,y: (x[0] + y[0], x[1] + y[1]))

sc.parallelize([1,2,3,4]).aggregate((0,0), seqOp, combOp) # (10,4)
sc.parallelize([]).aggregate((0,0), seqOp, combOp) # (0,0)

image

Key-Value RDD Transformations & Actions

  • Transformations
    • groupByKey
    • reduceByKey
    • mapValues
    • keys
    • join (+leftOuterJoin, rightOuterJoin)
  • Actions
    • countByKey
  • Key-Value RDD์—์„œ Tranformations๊ฐ€ ๋งŽ์€ ์ด์œ : ์ฒ˜๋ฆฌ๊ณผ์ •์—์„œ ๋‚˜์˜จ ๊ฒฐ๊ณผ๊ฐ’์ด ํŒŒํ‹ฐ์…˜์ด ์œ ์ง€๊ฐ€ ์•ˆ๋˜๋”๋ผ๋„ ๊ฐ’์ด ๊ต‰์žฅํžˆ ํฌ๊ธฐ ๋•Œ๋ฌธ์—

Key-Value RDD - GroupByKey

  • groupBy: ํ•จ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ Group
rdd = parallelize([1,1,2,3,5,8])
result = rdd.groupBy(lambda x: x % 2).collect()
sorted([(x, sorted(y)) for (x,y) in result()])
# [(0, [2,8]), (1, [1,1,3,5])]
  • groupByKey: Key๋ฅผ ๊ธฐ์ค€์œผ๋กœ Group
rdd = parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.groupByKey().mapValues(len).collect())
# [('a',2), ('b',1)]

sorted(rdd.groupByKey().mapValues(list).collect())
# [('a', [1,1]), ('b', [1])]

Key-Value RDD - ReduceByKey

  • reduce: ํ•จ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์š”์†Œ๋“ค์„ ํ•ฉ์นจ (action)
sc.parallelize([1,2,3,4,5]).reduce(add) 
# 15
  • reduceBykey: key๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃน์„ ๋งŒ๋“ค๊ณ  ํ•ฉ์นจ (trans)
rdd = sc.parallelize([("a",1), ("b",1), ("a",1)])
sorted(rdd.reduceByKey(add).collect())
# [('a',2), ('b',1)]
  • ๊ฐœ๋…์ ์œผ๋กœ๋Š” groupByKey + reduction
  • ํ•˜์ง€๋งŒ, groupByKey๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๋‹ค

Key-Value RDD - mapValues

  • ํ•จ์ˆ˜๋ฅผ ๋ฐธ๋ฅ˜์—๊ฒŒ๋งŒ ์ ์šฉํ•œ๋‹ค
  • ํŒŒํ‹ฐ์…˜๊ณผ ํ‚ค๋Š” ๊ทธ๋Œ€๋กœ ๋‚ฉ๋‘”๋‹ค. (ํŒŒํ‹ฐ์…˜๊ณผ ํ‚ค๋ฅผ ์™”๋‹ค๊ฐ”๋‹ค ํ•˜์ง€์•Š์•„์„œ ๋„คํŠธ์›Œํฌ ๋น„์šฉ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค)
x = sc.parallelize([("a", ["apple","banana","lemon"]), ("b", ["grapes"])])
def f(x): return len(x)
x.mapValues(f).collect()
# [('a',3), ('b',1)]

Key-Value RDD - countByKey

  • ๊ฐ ํ‚ค๊ฐ€ ๊ฐ€์ง„ ์š”์†Œ๋“ค์„ ์„ผ๋‹ค
rdd = sc.parallelize([("a",1), ("b",1), ("a",1)])
sorted(rdd.countByKey().items())
# [('a',2), ('b',1)]

Key-Value RDD - keys()

  • Transformation
  • ๋ชจ๋“  Key๋ฅผ ๊ฐ€์ง„ RDD๋ฅผ ์ƒ์„ฑ
m = sc.parallelize([(1,2), (3,4)]).keys()
m.collect()
# [1,3]

Key-Value RDD - Joins

  • Transformation
  • ์—ฌ๋Ÿฌ๊ฐœ์˜ RDD๋ฅผ ํ•ฉ์น˜๋Š”๋ฐ ์‚ฌ์šฉ
  • ๋Œ€ํ‘œ์ ์œผ๋กœ ๋‘๊ฐ€์ง€์˜ join ๋ฐฉ์‹์ด ์กด์žฌํ•œ๋‹ค.
    • Inner Join (join)
    • Outer join (left outer, right outer)
rdd1 = sc.parallelize([("foo",1), ("bar",2), ("baz",3)])
rdd2 = sc.parallelize([("foo",4), ("bar",5), ("bar", 6), ("zoo", 1)])

rdd1.join(rdd2).collect()
# [('bar',(2,5)), ('bar', (2,6)), ('foo', (1,4))]

rdd1.leftOuterJoin(rdd2).collect()
# [('baz', (3, None)), ('bar', (2,5)), ('bar', (2,6)), ('foo', (1,4))]

rdd1.rightOuterJoin(rdd2).collect()
# [('bar', (2,5)), ('bar', (2,6)), ('zoo', (None,1)), ('foo', (1,4))]

Shuffling

  • ๊ทธ๋ฃนํ•‘์‹œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋…ธ๋“œ์—์„œ ๋‹ค๋ฅธ๋…ธ๋“œ๋กœ ์˜ฎ๊ธธ ๋•Œ ์‚ฌ์šฉ
  • ์„ฑ๋Šฅ์„ (๋งŽ์ด) ์ €ํ•˜์‹œํ‚จ๋‹ค
  • groupByKey๋ฅผ ํ•  ๋•Œ๋„ ๋ฐœ์ƒํ•œ๋‹ค.
  • ์—ฌ๋Ÿฌ ๋…ธ๋“œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ๊ณ  ๋ฐ›๊ฒŒ ๋˜์„œ ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ์˜ ๋น„์šฉ์ด ๋†’๋‹ค
  • Shuffle์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ž‘์—…๋“ค
    • Join, leftOuterJoin, rightOuterJoin
    • GroupByKey
    • ReduceByKey
    • ComebineByKey
    • Distinct
    • Intersection
    • Repartition
    • Coalesce
  • Shuffle์ด ๋ฐœ์ƒํ•˜๋Š” ์‹œ์ 
    • ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜ค๋Š” RDD๊ฐ€ ์›๋ณธ RDD์˜ ๋‹ค๋ฅธ ์š”์†Œ๋ฅผ ์ฐธ์กฐํ•˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ RDD๋ฅผ ์ฐธ์กฐํ•  ๋•Œ

Partitioner๋ฅผ ์ด์šฉํ•œ ์„ฑ๋Šฅ ์ตœ์ ํ™” (Shuffle ์ตœ์†Œํ™”)

  • ๋ฏธ๋ฆฌ ํŒŒํ‹ฐ์…˜์„ ๋งŒ๋“ค์–ด ๋‘๊ณ  ์บ์‹ฑ ํ›„ reduceByKey ์‹คํ–‰
  • ๋ฏธ๋ฆฌ ํŒŒํ‹ฐ์…˜์„ ๋งŒ๋“ค์–ด ๋‘๊ณ  ์บ์‹ฑ ํ›„ join ์‹คํ–‰
  • ๋‘˜๋‹ค ํŒŒํ‹ฐ์…˜๊ณผ ์บ์‹ฑ์„ ์กฐํ•ฉํ•ด์„œ ์ตœ๋Œ€ํ•œ ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ ์—ฐ์‚ฐ์ด ์‹คํ–‰๋˜๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹
  • ์…”ํ”Œ์„ ์ตœ์†Œํ™”ํ•˜๋ฉด 10๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

์˜ˆ์‹œ groupByKey vs reduceByKey

# reduceByKey
(textRDD
.flatMap(lambda lin: line.split()) # ๋™์ผํ•œ ๋…ธ๋“œ์—์„œ ์‹คํ–‰
.map(lambda work: (word, 1)) # ๋™์ผํ•œ ๋…ธ๋“œ์—์„œ ์‹คํ–‰ 
.reduceByKey(lambda a, b: a+b)) # ์…”ํ”Œ ๋ฐœ์ƒ 

# groupByKey
(textRDD
.flatMap(lambda line: line.split())
.map(lambda word: (word,1)) 
.groupByKey() # ์…”ํ”Œ ๋ฐœ์ƒ
.map(lambda (w, counts): (w, sum(counts)))) 

# ๊ฐ€๊ธ‰์ ์ด๋ฉด, groupByKey๋Œ€์‹ ์— reduceByKey๋กœ ๋Œ€์ฒด ๊ฐ€๋Šฅํ•˜๋‹ˆ๊นŒ reduceByKey๋ฅผ ์‚ฌ์šฉํ•˜์ž.

Partition์˜ ๋ชฉ์ 

  • ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ๊ท ์ผํ•˜๊ฒŒ ํผํŠธ๋ฆฌ๊ณ , ์ฟผ๋ฆฌ๊ฐ€ ๊ฐ™์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์˜†์— ๋‘์–ด ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ

Partition ํŠน์ง•

  • RDD๋Š” ์ชผ๊ฐœ์ ธ์„œ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์— ์ €์žฅ๋จ
  • ํ•˜๋‚˜์˜ ํŒŒํ‹ฐ์…˜์€ ํ•˜๋‚˜์˜ ๋…ธ๋“œ(์„œ๋ฒ„)์—
  • ํ•˜๋‚˜์˜ ๋…ธ๋“œ๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ ํŒŒํ‹ฐ์…˜์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ
  • ํŒŒํ‹ฐ์…˜์˜ ํฌ๊ธฐ์™€ ๋ฐฐ์น˜๋Š” ์ž์œ ๋กญ๊ฒŒ ์„ค์ • ๊ฐ€๋Šฅํ•˜๋ฉฐ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ
  • Key-Value RDD๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋งŒ ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค.
  • ์ŠคํŒŒํฌ์˜ ํŒŒํ‹ฐ์…”๋‹ == ์ผ๋ฐ˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ

Partition์˜ ์ข…๋ฅ˜

  • Hash Partitioning
  • Range Partitioning

Hash Partitioning

  • ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์— ๊ท ์ผํ•˜๊ฒŒ ๋ถ„๋ฐฐํ•˜๋Š” ๋ฐฉ์‹
  • ๋”•์…”๋„ˆ๋ฆฌ์™€ ๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ถ„๋ฐฐ
  • ์ž˜๋ชป๋œ ์‚ฌ์šฉ
    • ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์— ๊ท ์ผํ•˜๊ฒŒ ๋ถ„๋ฐฐํ•˜๋Š” ๋ฐฉ์‹์ธ๋ฐ,
    • [๊ทน๋‹จ์ ์ธ ์˜ˆ] 2๊ฐœ์˜ ํŒŒํ‹ฐ์…˜์ด ์žˆ๋Š” ์ƒํ™ฉ์—์„œ ์ง์ˆ˜์˜ Key๋งŒ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์— Hash ํ•จ์ˆ˜๊ฐ€ (x%2)์ธ ๊ฒฝ์šฐ (ํ•œ์ชฝ ํŒŒํ‹ฐ์…˜๋งŒ ์‚ฌ์šฉ.)

Range Partitioning

  • ์ˆœ์„œ๊ฐ€ ์žˆ๋Š”, ์ •๋ ฌ๋œ ํŒŒํ‹ฐ์…”๋‹
  • ํ‚ค์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ์ •๋ ฌ
  • ํ‚ค์˜ ์ง‘ํ•ฉ์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ์ •๋ ฌ
  • ์„œ๋น„์Šค์˜ ์ฟผ๋ฆฌ ํŒจํ„ด์ด ๋‚ ์งœ ์œ„์ฃผ๋ฉด ์ผ๋ณ„ Range Partition ๊ณ ๋ ค

Memory & Disk Partition

  • Disk์—์„œ: partitionBy() (๋ณดํ†ต ์ด๊ฒƒ์„ ๋งŽ์ด ์‚ฌ์šฉ)
  • ๋ฉ”๋ชจ๋ฆฌ์—์„œ: repartition(), coalesce()

Disk Partition

  • ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ ํŒŒํ‹ฐ์…˜์„ ๊ฐ€์ง€๋Š” RDD๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜: partitionBy()
  • ํŒŒํ‹ฐ์…˜์„ ๋งŒ๋“  ํ›„์—” persist()๋ฅผ ํ•ด์•ผ ํ•œ๋‹ค
    • ํ•˜์ง€์•Š์œผ๋ฉด, ๋‹ค์Œ ์—ฐ์‚ฐ์— ๋ถˆ๋ฆด๋–„ ๋งˆ๋‹ค ๋ฐ˜๋ณตํ•˜๊ฒŒ ๋œ๋‹ค (์…”ํ”Œ๋ง์ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์ผ์–ด๋‚œ๋‹ค)
pairs = sc.parallelize([1,2,3,4,2,4,1]).map(lambda x: (x,x))
pairs.collect()
# [(1,1),(2,2),(3,3),(4,4),(2,2),(4,4),(1,1)]

pairs.partitionBy(2).glom().collect()
# [[(2,2), (4,4), (2,2), (4,4)], [(1,1), (3,3), (1,1)]]

pairs.partitionBy(2, lambda x: x%2).glom().collect()
# [[(2,2), (4,4), (2,2), (4,4)], [(1,1), (3,3), (1,1)]]

# glom์€ ํŒŒํ‹ฐ์…˜์ •๋ณด๊นŒ์ง€ ๊ฐ™์ด ๋ณด๋Š” ํ•จ์ˆ˜ 

Repartition & Coalesce

  • ๋‘˜๋‹ค ํŒŒํ‹ฐ์…˜์˜ ๊ฐฏ์ˆ˜๋ฅผ ์กฐ์ ˆํ•˜๋Š”๋ฐ ์‚ฌ์šฉ
  • ๋‘˜๋‹ค shuffling์„ ๋™๋ฐ˜ํ•˜์—ฌ ๋งค์šฐ ๋น„์‹ผ ์ž‘์—…
  • Repartition: ํŒŒํ‹ฐ์…˜์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ฑฐ๋‚˜ ๋Š˜๋ฆฌ๋Š”๋ฐ ์‚ฌ์šฉ
  • Coalesce: ํŒŒํ‹ฐ์…˜์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š”๋ฐ ์‚ฌ์šฉ (์ค„์ผ๋• Repartition๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์Œ )

์—ฐ์‚ฐ ์ค‘์— ํŒŒํ‹ฐ์…˜์„ ๋งŒ๋“œ๋Š” ์ž‘์—…๋“ค

  • Join (+ Outer join)
  • groupByKey
  • reduceByKey
  • foldByKey
  • partitionBy
  • Sort
  • mapValues (parent)
  • flatMapValues (parent)
  • filter (parent)
  • ๋“ฑ
  • mapValues, flatMapValues, filter๋Š” parent RDD์—์„œ ํŒŒํ‹ฐ์…˜์ด ์ •์˜๋˜์–ด ์žˆ์œผ๋ฉด ๊ทธ๊ฑธ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ

map vs mapValues

  • map, flatMap์€ ์™œ ํŒŒํ‹ฐ์…˜์„ ์•ˆ๋งŒ๋“ค๊นŒ? => map, flatMap์€ key๊ฐ’์ด ๋ฐ”๋€” ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํŒŒํ‹ฐ์…˜์„ ํ•ด๋†“์€๊ฒŒ ์˜๋ฏธ๊ฐ€ ์—†์–ด์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ
  • ๊ทธ๋ž˜์„œ ํŒŒํ‹ฐ์…˜์ด ์ž˜ ์ •์˜๋˜์–ด ์žˆ๋‹ค๋ฉด mapValues, flatMapValues๋ฅผ ์“ฐ๋Š”๊ฒƒ์ด ์ข‹๋‹ค.

8. Spark SQL

join().filter() vs filter().join() ์„ ๋น„๊ตํ•˜๋ฉด ๋‹น์—ฐํžˆ filter().join()์ด ์„ฑ๋Šฅ์ด ๋” ๋น ๋ฅด๋‹ค.
์œ„์™€ ๊ฐ™์€ ๊ณ ๋ฏผ์„ ์ŠคํŒŒํฌ๊ฐ€ ์•Œ์•„์„œ ํ•ด์ฃผ๋ฉด ์ข‹๊ฒ ๋Š”๋ฐ, ์–ด๋–ป๊ฒŒ ๊ฐ€๋Šฅํ• ๊นŒ?
๋ฐ์ดํ„ฐ๊ฐ€ ๊ตฌ์กฐํ™” ๋˜์–ด ์žˆ๋‹ค๋ฉด ์ž๋™์œผ๋กœ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

Structured Data vs Unstructured Data

  • Unstructured: Free Form
    • ๋กœ๊ทธ ํŒŒ์ผ
    • ์ด๋ฏธ์ง€
  • Semi Structured: ํ–‰๊ณผ ์—ด
    • CSV
    • JSON
    • XML
  • Structured: ํ–‰๊ณผ ์—ด + ๋ฐ์ดํ„ฐ ํƒ€์ž… (์Šคํ‚ค๋งˆ)
    • ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค

Structured Data vs RDDs

  • RDD์—์„œ๋Š”
    • ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ๋ชจ๋ฅด๊ธฐ ๋–„๋ฌธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์„ ๊ฐœ๋ฐœ์ž์—๊ฒŒ ์˜์กดํ•œ๋‹ค.
    • map, flatMap, filter ๋“ฑ์„ ํ†ตํ•ด ์œ ์ €๊ฐ€ ๋งŒ๋“  function์„ ์ˆ˜ํ–‰
  • Structured Data์—์„œ๋Š”
    • ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ์ด๋ฏธ ์•Œ๊ณ  ์žˆ์œผ๋ฏ€๋กœ ์–ด๋–ค ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ๊ฒƒ์ธ์ง€ ์ •์˜๋งŒ ํ•˜๋ฉด ๋จ
    • ์ตœ์ ํ™”๋„ ์ž๋™์œผ๋กœ ํ•  ์ˆ˜ ์žˆ์Œ

Spark SQL

  • ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.
  • ์œ ์ €๊ฐ€ ์ผ์ผ์ด function์„ ์ •์˜ํ•˜๋Š” ์ผ ์—†์ด ์ž‘์—…์„ ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์ž๋™์œผ๋กœ ์—ฐ์‚ฐ์ด ์ตœ์ ํ™” ๋œ๋‹ค

Spark SQL์˜ ๋ชฉ์ 

  • ์ŠคํŒŒํฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋‚ด๋ถ€์—์„œ ๊ด€๊ณ„ํ˜• ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ
  • ์Šคํ‚ค๋งˆ์˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์ž๋™์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ
  • ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ

Spark SQL ์†Œ๊ฐœ

  • ์ŠคํŒŒํฌ ์œ„์— ๊ตฌํ˜„๋œ ํ•˜๋‚˜์˜ ํŒจํ‚ค์ง€
  • 3๊ฐœ์˜ ์ฃผ์š” API
    • SQL
    • DataFrame
    • Datasets
  • 2๊ฐœ์˜ ๋ฐฑ์—”๋“œ ์ปดํฌ๋„ŒํŠธ
    • Catalyst - ์ฟผ๋ฆฌ ์ตœ์ ํ™” ์—”์ง„
    • Tungsten - ์‹œ๋ฆฌ์–ผ๋ผ์ด์ €(์šฉ๋Ÿ‰ ์ตœ์ ํ™”)

DataFrame

  • Spark Core์— RDD๊ฐ€ ์žˆ๋‹ค๋ฉด, Spark SQL์—๋Š” DataFrame์ด ์žˆ๋‹ค.
  • DataFrame์€ ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ์…‹์ด๋ผ๊ณ  ๋ณด๋ฉด ๋จ
  • ๊ฐœ๋…์ ์œผ๋กœ๋Š” RDD์— ์Šคํ‚ค๋งˆ๊ฐ€ ์ ์šฉ๋œ ๊ฒƒ์ด๋ผ๊ณ  ๋ณด๋ฉด ๋จ

SparkSession

  • Spark Core์— SparkContext๊ฐ€ ์žˆ๋‹ค๋ฉด Spark SQL์—๋Š” SparkSession์ด ์žˆ๋‹ค.
spark = SparkSession.builder.appName("test-app").getOrCreate()

DataFrame ๋งŒ๋“œ๋Š” ๋ฒ•

  • RDD์—์„œ ์Šคํ‚ค๋งˆ๋ฅผ ์ •์˜ํ•œ๋‹ค์Œ ๋ณ€ํ˜•์„ ํ•˜๊ฑฐ๋‚˜
  • CSV, JSON๋“ฑ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์˜ค๋ฉด ๋œ๋‹ค

RDD๋กœ๋ถ€ํ„ฐ DataFrame ๋งŒ๋“ค๊ธฐ

  • Schema๋ฅผ ์ž๋™์œผ๋กœ ์œ ์ถ”ํ•ด์„œ DataFrame ๋งŒ๋“ค๊ธฐ
  • Schema๋ฅผ ์‚ฌ์šฉ์ž๊ฐ€ ์ •์˜ํ•˜๊ธฐ
# RDD ๋งŒ๋“ค๊ธฐ
lines = sc.textfile("example.csv")
data = lines.map(lambda x: x.split(","))
preprocessed = data.map(lambda x: Row(name=x[0], price=int(x[1])))
 
# Infer (Schema๋ฅผ ์œ ์ถ”ํ•ด์„œ ๋งŒ๋“ค๊ธฐ)
df = spark.createDataFrame(preprocessed)

# Specify (Schema๋ฅผ ์‚ฌ์šฉ์ž๊ฐ€ ์ •์˜ํ•˜๊ธฐ)
schema = StructType(
  StructField("name", StringType(), True),
  StructField("price", StringType(), True)
)
spark.createDataFrame(preprocessed, schema).show()

ํŒŒ์ผ๋กœ๋ถ€ํ„ฐ DataFrame ๋งŒ๋“ค๊ธฐ

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test-app").getOrCreate()

# JSON
dataframe = spark.read.json('dataset/nyt2.json')
# TXT FILE
dataframe_txt = spark.read.text('text_data.txt')
# CSV FILE
dataframe_csv = spark.read.csv('csv_data.csv')
# PARQUET FILE
dataframe_parquet = spark.read.load('parquet.data.parquet')

DataFrame์„ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ํ…Œ์ด๋ธ”์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•˜๊ธฐ

  • createOrReplaceTempView() ํ•จ์ˆ˜๋กœ temporary view๋ฅผ ๋งŒ๋“ค์–ด ์ค˜์•ผ ํ•จ.
data.createOrReplaceTempView("mobility_data")
spark.sql("SELECT pickup_datetime FROM mobility_data LIMIT 5").show()

Spark์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” SQL๋ฌธ

  • Hive Query Language์™€ ๊ฑฐ์˜ ๋™์ผ
  • Select
  • From
  • Where
  • Count
  • Having
  • Group By
  • Order By
  • Sort By
  • Distinct
  • Join

Python์—์„œ Spark SQL ์‚ฌ์šฉํ•˜๊ธฐ

  • Spark SQL์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” SparkSession
  • SparkSession ์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฐ์ดํ„ฐ๋Š” DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test-app").getOrCreate()

# JSON
dataframe = spark.read.json('dataset/nyt2.json')
# TXT FILE
dataframe_txt = spark.read.text('text_data.txt')
# CSV FILE
dataframe_csv = spark.read.csv('csv_data.csv')
# PARQUET FILE
dataframe_parquet = spark.read.load('parquet.data.parquet')
  • SQL๋ฌธ์„ ์‚ฌ์šฉํ•ด์„œ ์ฟผ๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
data.createOrReplaceTempView("mobility_data")
spark.sql("SELECT pickup_datetime FROM mobility_data LIMIT 5").show()
  • ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ฟผ๋ฆฌ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.
df.select(df['name'], df['age'] + 1).show()

df.filter(df['age'] > 21).show()

df.groupBy("age").count().show()
  • DataFrame์„ RDD๋กœ ๋ณ€ํ™˜ํ•ด ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
    • rdd = df.rdd.map(tuple)
    • (ํ•˜์ง€๋งŒ, RDD๋ฅผ ๋œ ์‚ฌ์šฉํ•˜๋Š” ์ชฝ์ด ์ข‹๋‹ค)

RDD๋ฅผ ์‚ฌ์šฉ์•ˆํ•˜๊ณ  DataFrame์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์˜ ์žฅ์ 

  • MLLib์ด๋‚˜ Spark Streaming ๊ฐ™์€ ๋‹ค๋ฅธ ์ŠคํŒŒํฌ ๋ชจ๋“ˆ๋“ค๊ณผ ์‚ฌ์šฉํ•˜๊ธฐ ํŽธํ•˜๋‹ค.
  • ๊ฐœ๋ฐœํ•˜๊ธฐ ํŽธํ•˜๋‹ค.
  • ์ตœ์ ํ™”๋„ ์•Œ์•„์„œ ๋œ๋‹ค.

Datasets

  • Type์ด ์žˆ๋Š” DataFrame
  • PySpark์—์„  ํฌ๊ฒŒ ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š์•„๋„ ๋œ๋‹ค.

SQL ์‹ค์Šต

./1-spark/learn-sql.ipynb

DataFrame ํŠน์ง•

  • ๊ด€๊ณ„ํ˜• ๋ฐ์ดํ„ฐ์ด๋‹ค.
  • ํ•œ๋งˆ๋””๋กœ ๊ด€๊ณ„ํ˜• ๋ฐ์ดํ„ฐ์…‹ = RDD + Relation
  • RDD๊ฐ€ ํ•จ์ˆ˜ํ˜• API๋ฅผ ๊ฐ€์กŒ๋‹ค๋ฉด DataFrame์€ ์„ ์–ธํ˜• API
  • ์ž๋™์œผ๋กœ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅ
  • ํƒ€์ž…์ด ์—†๋‹ค
  • RDD์˜ ํ™•์žฅํŒ
    • ์ง€์—ฐ ์‹คํ–‰ (Lazy Execution)
    • ๋ถ„์‚ฐ ์ €์žฅ
    • Immutable
    • ์—ด(Row) ๊ฐ์ฒด๊ฐ€ ์žˆ๋‹ค
    • SQL ์ฟผ๋ฆฌ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ์Šคํ‚ค๋งˆ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ณ  ์ด๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๋”์šฑ ์ตœ์ ํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค
    • CSV, JSON, Hive ๋“ฑ์œผ๋กœ ์ฝ์–ด์˜ค๊ฑฐ๋‚˜ ๋ณ€ํ™˜์ด ๊ฐ€๋Šฅ

DataFrame์˜ ์Šคํ‚ค๋งˆ๋ฅผ ํ™•์ธํ•˜๋Š” ๋ฒ•

  • dtypes
  • show()
    • ํ…Œ์ด๋ธ” ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅ
    • ์ฒซ 20๊ฐœ์˜ ์—ด๋งŒ ๋ณด์—ฌ์ค€๋‹ค
  • printSchema()
    • ์Šคํ‚ค๋งˆ๋ฅผ ํŠธ๋ฆฌ ํ˜•ํƒœ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

DataFrame Operations

  • SQL ๊ณผ ๋น„์Šทํ•œ ์ž‘์—…์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
  • Select
  • Where
  • Limit
  • OrderBy
  • GroupBy
  • Join

DataFrame Select

  • ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” Column์ด๋‚˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœ ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ
df.select('*').collect()
df.select('name','age').collect()
df.select(df.name, (df.age+10).alias('age')).collect()

DataFrame Agg

  • Aggregate์˜ ์•ฝ์ž๋กœ, ๊ทธ๋ฃนํ•‘ ํ›„ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ์ž‘์—…
df.agg({"age",: "max"}).collect() 
# [Row(max(age)=5)]

from pyspark.sql improt functions as F
df.agg(F.min(df.age)).collect()
# [Row(min(age)=2)]

DataFrame GroupBy

  • ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ column์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ Groupingํ•˜๋Š” ์ž‘์—…
df.groupBy().avg().collect()
# [Row(avg(age)=3.5)]

sorted(df.groupBy('name').agg({'age': 'mean'}).collect())
# [Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]

sorted(df.groupBy(df.name).avg().collect())
# [Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]

sorted(df.groupBy(['name', df.age]).count().collect())
# [Row(name='Alice', age=2, count=1), Row(name='Bob', age=5, count=1)]

DataFrame Join

  • ๋‹ค๋ฅธ DataFrame๊ณผ ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ Column์„ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์น˜๋Š” ์ž‘์—…
df.join(df2, 'name').select(df.name, df2.height).collect()
# [Row(name='Bob', height=85)]

Spark SQL๋กœ ํŠธ๋ฆฝ ์ˆ˜ ์„ธ๊ธฐ

  • ์ด์ „์— RDD๋กœ ์‹ค์Šตํ•ด๋ณด์•˜๋Š”๋ฐ, ์ด๋ฒˆ์—” Spark SQL๋กœ ํ•ด๋ณด์ž.
./1-spark/trip_count_sql.ipynb

Spark SQL๋กœ ๋‰ด์š•์˜ ๊ฐ ํ–‰์ •๊ตฌ ๋ณ„ ๋ฐ์ดํ„ฐ ์ถ”์ถœํ•˜๊ธฐ

  • ๋‘ ํ…Œ์ด๋ธ”์˜ JOIN ์‹ค์Šต
./1-spark/trip_count_sql_by_zone-Copy1.ipynb

Spark์˜ ๋‘๊ฐœ์˜ ์—”์ง„

  • ์ŠคํŒŒํฌ๋Š” ์ฟผ๋ฆฌ๋ฅผ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด ๋‘๊ฐ€์ง€ ์—”์ง„์„ ์‚ฌ์šฉํ•œ๋‹ค.
  • Catalyst, Tungsten

Logical Plan์ด๋ž€

  • ์ˆ˜ํ–‰ ํ•ด์•ผ ํ•˜๋Š” ๋ชจ๋“  transformation ๋‹จ๊ณ„์— ๋Œ€ํ•œ ์ถ”์ƒํ™”
  • ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•ด์•ผ ํ•˜๋Š”์ง€ ์ •์˜ํ•˜์ง€๋งŒ,
  • ์‹ค์ œ ์–ด๋””์„œ ์–ด๋–ป๊ฒŒ ๋™์ž‘ ํ•˜๋Š”์ง€๋Š” ์ •์˜ํ•˜์ง€ ์•Š์Œ

Physical Plan์ด๋ž€

  • Logical Plan์ด ์–ด๋–ป๊ฒŒ ํด๋Ÿฌ์Šคํ„ฐ ์œ„์—์„œ ์‹คํ–‰ ๋ ์ง€ ์ •์˜
  • ์‹คํ–‰ ์ „๋žต์„ ๋งŒ๋“ค๊ณ  Cost Model์— ๋”ฐ๋ผ ์ตœ์ ํ™”

Catalyst ๋ž€

  • SQL๊ณผ DataFrame์ด ๊ตฌ์กฐ๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๋ชจ๋“ˆ
  • Logical Plan์„ Physical Plan์œผ๋กœ ๋ฐ”๊พธ๋Š” ์ผ์„ ํ•œ๋‹ค.

Catalyst Logical Plan -> Physical Plan ๋™์ž‘ ์ˆœ์„œ

  1. ๋ถ„์„: DataFrame ๊ฐ์ฒด์˜ relation์„ ๊ณ„์‚ฐ, ์นผ๋Ÿผ์˜ ํƒ€์ž…๊ณผ ์ด๋ฆ„ ํ™•์ธ
  2. Logical Plan ์ตœ์ ํ™”
    1. ์ƒ์ˆ˜๋กœ ํ‘œํ˜„๋œ ํ‘œํ˜„์‹์„ Compile Time์— ๊ณ„์‚ฐ (x runtime)
    2. Predicate Pushdown: join & filter -> filter & join
    3. Projection Pruning: ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ์นผ๋Ÿผ๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ
  3. Physical Plan ๋งŒ๋“ค๊ธฐ: Spark์—์„œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ Plan์œผ๋กœ ๋ณ€ํ™˜
  4. ์ฝ”๋“œ ์ œ๋„ค๋ ˆ์ด์…˜: ์ตœ์ ํ™”๋œ Physical Plan์„ Java Bytecode๋กœ

Catalyst Pipeline

image

Logical Planning ์ตœ์ ํ™”

SELECT zone_data.Zone, count(*) AS trips \
  FROM trip_data JOIN zone_data \
  ON trip_data.PULocationID = zone_data.LocationID \
  WHERE trip_data.hvfhs_license_num = 'HV0003' \
  GROUP BY zone_data.Zone order by trips desc

๊ธฐ๋ณธ ์ˆœ์„œ

  1. Scan: ๋‘๊ฐœ์˜ ํ…Œ์ด๋ธ”์—์„œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
  2. Join: join
  3. Filter: trip_data.hvfhs_license_num = 'HV0003
  4. Project: count(*) AS trips
  5. Aggregate: group by

์ตœ์ ํ™”

  1. Scan: ๋‘๊ฐœ์˜ ํ…Œ์ด๋ธ”์—์„œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
  2. Filter: trip_data.hvfhs_license_num = 'HV0003
  3. Join: join
  4. Project: count(*) AS trips
  5. Aggregate: group by

Explain

spark.sql(query).explain(True)

image

  • explain(True) ๋ช…๋ น์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ์•„๋ž˜์˜ ์ •๋ณด๋“ค์„ ๋ณด์—ฌ์ค€๋‹ค
    • Parsed Logical Plan: ์‚ฌ์šฉ์ž๊ฐ€ ์“ด ์ฝ”๋“œ ๊ทธ๋Œ€๋กœ
    • Analyzed Logical Plan: ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ ํ…Œ์ด๋ธ”์˜ ๋ฌด์Šจ ์ปฌ๋Ÿผ์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.
    • Optimized Logical Plan: Filtering์ฝ”๋“œ๋ฅผ ๋” ๋นจ๋ฆฌ ํ•˜๋Š” ๋“ฑ ์ตœ์ ํ™”๋œ ์ฝ”๋“œ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค
    • Physical Plan: ๋””ํ…Œ์ผํ•œ Plan์„ ๋ณด์—ฌ์คŒ
  • explain(True ์—†์ด) ๋ช…๋ น์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ์•„๋ž˜ ์ •๋ณด๋งŒ ๋‚˜์˜จ๋‹ค.
    • Physical Plan

Tungsten

  • Physical Plan์ด ์„ ํƒ๋˜๊ณ  ๋‚˜๋ฉด ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ์‹คํ–‰๋  Bytecode๊ฐ€ ๋งŒ๋“ค์–ด์ง„๋‹ค. (Code Generation)
  • ์ŠคํŒŒํฌ ์—”์ง„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋ชฉ์ 
    • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์ตœ์ ํ™”
    • ์บ์‹œ ํ™œ์šฉ ์—ฐ์‚ฐ
    • ์ฝ”๋“œ ์ƒ์„ฑ

UDF

  • user-defined-functions
  • sql ๋ฌธ์•ˆ์—์„œ ์“ธ ์ˆ˜ ์žˆ๋Š” function์„ ๋งŒ๋“œ๋Š”๊ฒƒ

์‹ค์Šต

./1-spark/user-defined-functions.ipynb

๋‰ด์š• ํƒ์‹œ ๋ฐ์ดํ„ฐ ๋ถ„์„

./1-spark/taxi-analysis.ipynb

9. MLlib

MLlib์ด๋ž€

  • Machine Learning Library
  • ML์„ ์‰ฝ๊ณ  ํ™•์žฅ์„ฑ ์žˆ๊ฒŒ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ
  • ๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋ฐœ์„ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด

Machine Learning ์ด๋ž€

  • ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์ฝ”๋”ฉ์„ ํ•˜๋Š” ์ผ
  • ์ตœ์ ํ™”์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ํŒจํ„ด์„ ์ฐพ๋Š”์ผ

MLlib์˜ ์—ฌ๋Ÿฌ ์ปดํฌ๋„ŒํŠธ

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • Classification
    • Regression
    • Clustering
    • Recommendation
  • ํŒŒ์ดํ”„๋ผ์ธ
    • Training
    • Evaluating
    • Tuning
    • Persistence
  • Feature Engineering
    • Extraction
    • Transformation
  • Utils
    • Linear algebra
    • Statistics

ML ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์„ฑ

  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ -> ์ „์ฒ˜๋ฆฌ -> ํ•™์Šต -> ๋ชจ๋ธ ํ‰๊ฐ€
  • ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ํ›„ ์œ„ ๊ณผ์ •์„ ๋‹ค์‹œ ์‹œ๋„

MLlib์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๋“ค

  • ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง
  • ํ†ต๊ณ„์  ์—ฐ์‚ฐ
  • ํ”ํžˆ ์“ฐ์ด๋Š” ML์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค
    • Regression (Linea, Logistic)
    • Support Vector Machines
    • Naive Bayes
    • Decision Tree
    • K-Means clustering
  • ์ถ”์ฒœ (Alternating Least Squares)

MLlib์€ DataFrame์œ„์—์„œ ๋™์ž‘ํ•œ๋‹ค.

  • ์•„์ง RDD API๊ฐ€ ์žˆ์ง€๋งŒ, "maintenance mode"
    • ์ƒˆ๋กœ์šด API๋Š” ๊ฐœ๋ฐœ์ด ๋Š๊น€
  • DataFrame์„ ์“ฐ๋Š” MLlib API๋ฅผ Spark ML์ด๋ผ๊ณ ๋„ ๋ถ€๋ฆ„

MLlib์˜ ์ฃผ์š” Components

  • DataFrame
  • Transformer
  • Estimator
  • Evaluator
  • Pipeline
  • Parameter

MLlib - Transformer

  • ํ”ผ์ณ ๋ณ€ํ™˜๊ณผ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ถ”์ƒํ™”
  • ๋ชจ๋“  Transformer๋Š” transform() ํ•จ์ˆ˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค
  • ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ํฌ๋ฉง์œผ๋กœ ๋ฐ”๊พผ๋‹ค
  • DF๋ฅผ ๋ฐ›์•„ ์ƒˆ๋กœ์šด DF๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ, ๋ณดํ†ต ํ•˜๋‚˜ ์ด์ƒ์˜ column์„ ๋”ํ•˜๊ฒŒ ๋œ๋‹ค
  • ์˜ˆ)
    • Data Normalization
    • Tokenization
    • ์นดํ…Œ๊ณ ๋ฆฌ์ปฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž๋กœ (one-hot encoding)

MLlib - Estimator

  • ๋ชจ๋ธ์˜ ํ•™์Šต ๊ณผ์ •์„ ์ถ”์ƒํ™”
  • ๋ชจ๋“  Estimator๋Š” fit() ํ•จ์ˆ˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค
  • fit()์€ DataFrame์„ ๋ฐ›์•„ Model์„ ๋ฐ˜ํ™˜
  • ๋ชจ๋ธ์„ ํ•˜๋‚˜์˜ Transformer
  • ์˜ˆ)
    • lr = LinearRegression()
    • model = lr.fit(data)

MLlib - Evaluator

  • metric์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€
    • ์˜ˆ) Root mean squared error (RMSE)
  • ๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ๊ฐœ ๋งŒ๋“ค์–ด์„œ, ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ ํ›„ ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์„ ๋ฝ‘๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ ํŠœ๋‹์„ ์ž๋™ํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์˜ˆ)
    • BinarClassificationEvaluator
    • CrossValidator

MLlib - Pipeline

  • ML์˜ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ •์˜ํ•  ๋•Œ ์‚ฌ์šฉ
  • ์—ฌ๋Ÿฌ stage๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค
  • ์ €์žฅ๋  ์ˆ˜ ์žˆ๋‹ค. (persist)
  • ํŒŒ์ดํ”„๋ผ์ธ ์˜ˆ: ๋ฐ์ดํ„ฐ๋กœ๋”ฉ -> ์ „์ฒ˜๋ฆฌ -> ํ•™์Šต -> ๋ชจ๋ธํ‰๊ฐ€
  • Transformer -> Tranformer -> Estimator -> Evaluator -> Model

์ฒซ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•

./1-spark/logistic-regression.ipynb
./1-spark/pipeline.ipynb

ALS ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • Alternating Least Squares

image

- ์œ ์ € A์™€ B์˜ ์ทจํ–ฅ์ด ๋น„์Šทํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. - ์ด๋•Œ, ์œ ์ € A์—๊ฒŒ Casablanca๋ฅผ ์ถ”์ฒœํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

์ถ”์ฒœ์ด๋ž€

  • ์•„์ง ๋ชป๋ณธ ์˜ํ™”๋“ค์˜ ํ‰์ ์„ ์˜ˆํ•˜๊ณ ,
  • ๊ฐ’์„ ์ •๋ ฌํ•ด์„œ ์ œ์ผ ์œ„์—์„œ ๋ถ€ํ„ฐ ์œ ์ €์—๊ฒŒ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ์ถ”์ฒœ์ด๋‹ค.

์˜ํ™” ์ถ”์ฒœ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•

./1-spark/movie-recommendation.ipynb

Supervised Leaning

  • ์ง€๋„ ํ•™์Šต
  • Regression, Classification ๋‘˜๋‹ค ์ง€๋„ํ•™์Šต์ด๋‹ค.
  • Regression: ์˜ˆ์ธก๋œ ๊ฐ’์ด ์‹ค์ˆ˜
  • Classification: ์˜ˆ์ธก๋œ ๊ฐ’์ด ํด๋ž˜์Šค(์นดํ…Œ๊ณ ๋ฆฌ)

ํƒ์‹œ๋น„ ์˜ˆ์ธกํ•˜๊ธฐ1

./1-spark/taxi-fare-prediction.ipynb

ํƒ์‹œ๋น„ ์˜ˆ์ธกํ•˜๊ธฐ2

./1-spark/taxi-fare-prediction-2.ipynb

ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”

./1-spark/taxi-fare-prediction-hyper.ipynb

๋ชจ๋ธ ์ €์žฅ & ๋กœ๋”ฉ

./1-spark/taxi-fare-prediction-hyper.ipynb

10. Spark Streaming

Spark Streaming์ด๋ž€

  • SQL ์—”์ง„ ์œ„์— ๋งŒ๋“ค์–ด์ง„ ๋ถ„์‚ฐ ์ŠคํŠธ๋ฆผ ์ฒ˜๋ฆฌ ํ”„๋กœ์„ธ์‹ฑ
  • ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ์‚ฌ์šฉ
  • ์‹œ๊ฐ„๋Œ€ ๋ณ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ์ณ(aggregate) ๋ถ„์„ ํ•  ์ˆ˜ ์žˆ์Œ
  • kafka, Amazon Kinesis, HDFS ๋“ฑ๊ณผ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ
  • ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๋งŒ๋“ค์–ด์„œ ๋ถ€๋ถ„์ ์ธ ๊ฒฐํ•จ์ด ๋ฐœ์ƒํ•ด๋„ ๋‹ค์‹œ ๋Œ์•„๊ฐ€์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

Streaming Data๋ž€

  • ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์€ ๋ฌดํ•œํ•œ ํ…Œ์ด๋ธ”์ด๋‹ค.
  • input Data Stream --SparkStreaming--> batches of input data --SparkEngine--> batches of processed data

Discretized Streams (DStreams)

  • Spark Stream์˜ ๊ธฐ๋ณธ์ ์ธ ์ถ”์ƒํ™”
  • ๋‚ด๋ถ€์ ์œผ๋ก  RDD์˜ ์—ฐ์†์ด๊ณ  RDD์˜ ์†์„ฑ์„ ์ด์–ด๋ฐ›์Œ

Window Operations

  • ์ง€๊ธˆ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ด์ „ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•  ๋•Œ

Streaming Query: Source

  • ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋””์—์„œ ์ฝ์–ด์˜ฌ ์ง€ ๋ช…์‹œ
  • ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋ฅผ ์‚ฌ์šฉํ•ด join()์ด๋‚˜ union()์œผ๋กœ ํ•ฉ์ณ ์“ธ ์ˆ˜ ์žˆ๋‹ค
spark.readStream.format("kafka")
  .option("kafka.bootstrap.servers", ...)
  .option("subscribe","topic")
  .load()

Streaming Query: Transformation

spark.readStream.format("kafka")
  .option("kafka.bootstrap.servers", ...)
  .option("subscribe","topic")
  .load()
  .selectExpr("cast(value as string) as json")
  .select(from_json("json", schema).as("data"))

Streaming Query: Processing Details

spark.readStream.format("kafka")
  .option("kafka.bootstrap.servers", ...)
  .option("subscribe","topic")
  .load()
  .selectExpr("cast(value as string) as json")
  .select(from_json("json", schema).as("data"))
  .writeStream.format("parquet")
  .trigger("1 minute") # <-- micro-batch ์‹คํ–‰ ๊ฐ„๊ฒฉ
  .option("checkpointLocation", "...")
  .start()

Transformations

  • Map
  • FlatMap
  • Filter
  • ReduceByKey

State ๊ด€๋ฆฌ

  • ์ด์ „ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ State๋กœ ์ฃผ๊ณ  ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.
  • ์˜ˆ) ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ (ํ‚ค๊ฐ’ ๋ณ„) ์ดํ•ฉ

๊ฐ„๋‹จํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ตฌํ˜„

terminal1) nc -lk 9999 # ์†Œ์ผ“ ์—ด๊ธฐ 
terminal2) python3 ./1-spark/streaming.py

terminal1) test testa testb
terminal1) test test testa

11. Apache Airflow

Apache Airflow๋ž€

  • ์—์–ด๋น„์•ค๋น„์—์„œ ๊ฐœ๋ฐœํ•œ ์›Œํฌํ”Œ๋กœ์šฐ ์Šค์ผ€์ค„๋ง, ๋ชจ๋‹ˆํ„ฐ๋ง ํ”Œ๋žซํผ
  • ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ์ฒ˜๋ฆฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€๋Š” ๊ณณ์€ ์•„๋‹ˆ๋‹ค.
  • 2016๋…„ ์•„ํŒŒ์น˜ ์žฌ๋‹จ incubator program
  • ํ˜„์žฌ ์•„ํŒŒ์น˜ ํƒ‘๋ ˆ๋ฒจ ํ”„๋กœ์ ํŠธ
  • Airbnb, Yahoo, Paypal, Intel, Stripe

์›Œํฌํ”Œ๋กœ์šฐ ๊ด€๋ฆฌ ๋ฌธ์ œ

  • ๋งค์ผ 10์‹œ์— ์ฃผ๊ธฐ์ ์œผ๋กœ ๋Œ์•„๊ฐ€๋Š” ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋งŒ๋“ค๋ ค๋ฉด?
  • ๊ธฐ์กด ๋ฐฉ์‹: cron script๋กœ ์‚ฌ์šฉ
  • ๋งค์ผ 10์‹œ์— ์ฃผ๊ธฐ์ ์œผ๋กœ ๋Œ์•„๊ฐ€๋Š” ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ (์™ธ๋ถ€ api๋กœ download -> process(Spark Job) -> store(DB))๋“ค์„ ์ˆ˜์‹ญ๊ฐœ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค๋ฉด?

cron script์™€ ๊ฐ™์€ ๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ 

  • ์‹คํŒจ ๋ณต๊ตฌ: ์–ธ์ œ ์–ด๋–ป๊ฒŒ ๋‹ค์‹œ ์‹คํ–‰ํ•  ๊ฒƒ์ธ๊ฐ€? Backfill
  • ๋ชจ๋‹ˆํ„ฐ๋ง: ์ž˜ ๋Œ์•„๊ฐ€๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ํž˜๋“ค๋‹ค
  • ์˜์กด์„ฑ ๊ด€๋ฆฌ: ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ๊ฐ„ ์˜์กด์„ฑ์ด ์žˆ๋Š” ๊ฒฝ์šฐ ์ƒ์œ„ ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์ด ์ž˜ ๋Œ์•„๊ฐ€๊ณ  ์žˆ๋Š”์ง€ ํŒŒ์•…์ด ํž˜๋“ค๋‹ค
  • ํ™•์žฅ์„ฑ: ์ค‘์•™ํ™” ํ•ด์„œ ๊ด€๋ฆฌํ•˜๋Š” ํˆด์ด ์—†๊ธฐ ๋–„๋ฌธ์— ๋ถ„์‚ฐ๋œ ํ™˜๊ฒฝ์—์„œ ํŒŒ์ดํ”„๋ผ์ธ๋“ค์„ ๊ด€๋ฆฌํ•˜๊ธฐ ์–ด๋ ต๋‹ค
  • ๋ฐฐํฌ: ์ƒˆ๋กœ์šด ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋ฐฐํฌํ•˜๊ธฐ ํž˜๋“ค๋‹ค

AirFlow๋ž€

  • ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ž‘์„ฑํ•˜๊ณ  ์Šค์ผ€์ค„๋งํ•˜๊ณ  ๋ชจ๋‹ˆํ„ฐ๋ง ํ•˜๋Š” ์ž‘์—…์„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ํ”Œ๋žซํผ
  • ํŒŒ์ด์ฌ์œผ๋กœ ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ๊ฐ€๋Šฅ
  • ๋ถ„์‚ฐ๋œ ํ™˜๊ฒฝ์—์„œ ํ™•์žฅ์„ฑ์ด ์žˆ์Œ
  • ์›น ๋Œ€์‹œ๋ณด๋“œ (UI)
  • ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์ด ๊ฐ€๋Šฅ

Workflow๋ž€

  • ์˜์กด์„ฑ์œผ๋กœ ์—ฐ๊ฒฐ๋œ ์ž‘์—…(task)๋“ค์˜ ์ง‘ํ•ฉ == DAG == Directed Acyclic Graph

Airflow์˜ ๊ตฌ์„ฑ์š”์†Œ

  • ์›น ์„œ๋ฒ„ - ์›น ๋Œ€์‹œ๋ณด๋“œ UI
  • ์Šค์ผ€์ค„๋Ÿฌ - ์›Œํฌํ”Œ๋กœ์šฐ๊ฐ€ ์–ธ์ œ ์‹คํ–‰๋˜๋Š”์ง€ ๊ด€๋ฆฌ
  • Metastore - ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ
  • Executor - ํ…Œ์Šคํฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ์‹คํ–‰๋˜๋Š”์ง€ ์ •์˜
  • Worker - ํ…Œ์Šคํฌ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค

Operator

  • ์ž‘์—…์„ ์ •์˜ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ
  • Action Operators: ์‹ค์ œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰
  • Transfer Operators: ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฎ๊น€
  • Sensor Operators: ํ…Œ์Šคํฌ๋ฅผ ์–ธ์ œ ์‹คํ–‰์‹œํ‚ฌ ํŠธ๋ฆฌ๊ฑฐ๋ฅผ ๊ธฐ๋‹ค๋ฆผ

์ž‘์—…(Task)

  • Operator๋ฅผ ์‹คํ–‰์‹œํ‚ค๋ฉด Task๊ฐ€ ๋œ๋‹ค
  • Task = Operator Instance

Airflow์˜ ์œ ์šฉ์„ฑ

  • ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง ํ™˜๊ฒฝ์—์„œ ์œ ์šฉํ•˜๊ฒŒ ์“ฐ์ผ ์ˆ˜ ์žˆ๋‹ค
    • ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค
    • ๋จธ์‹ ๋Ÿฌ๋‹
    • ๋ถ„์„
    • ์‹คํ—˜
    • ๋ฐ์ดํ„ฐ ์ธํ”„๋ผ ๊ด€๋ฆฌ

Airflow์˜ One Node Architecture

image

  • WebServer, Metastore, Scheduler, Executor๊ฐ€ ์กด์žฌ
  • ๋™์ž‘ ๊ณผ์ •
    • Metastore์—์„œ dag์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์–ด์„œ, Web server์™€ Scheduler๊ฐ€ ๊ทธ ์ •๋ณด๋ฅผ ์ฝ์–ด ์˜ค๊ณ  Executor๋กœ ์ด ์ •๋ณด๋ฅผ ๋ณด๋‚ด์„œ ์‹คํ–‰์„ ํ•œ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ์‹คํ–‰๋œ Task Instance๋Š” metastore๋กœ ๋ณด๋‚ด์ ธ์„œ ์ƒํƒœ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ์—…๋ฐ์ดํŠธ๋œ ์ƒํƒœ๋ฅผ ๋‹ค์‹œ Web Server์™€ Scheduler๊ฐ€ ์ฝ์–ด์™€์„œ Task๊ฐ€ ์ž˜ ์™„๋ฃŒ๊ฐ€ ๋˜์—ˆ๋Š”์ง€ ํ™•์ธ์„ ํ•œ๋‹ค.
  • Executor์— Queue๊ฐ€ ์กด์žฌํ•ด์„œ ์ˆœ์„œ๋ฅผ ์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

Airflow์˜ Multi Node Architecture

image

  • Queue๊ฐ€ Executor ๋ฐ”๊นฅ์— ์กด์žฌ ํ•œ๋‹ค (One Node Architecture์™€์˜ ํฐ ์ฐจ์ด์ )
  • Celery Broker๊ฐ€ Queue์ด๋‹ค.
  • ๋™์ž‘ ๊ณผ์ •
    • MetaStore์—์„œ dag์ •๋ณด๋ฅผ webserver์™€ scheduler๊ฐ€ ์ •๋ณด๋ฅผ ์ฝ๊ณ , celery executor๋ฅผ ํ†ตํ•ด์„œ celery broker์— task ์ˆœ์„œ๋Œ€๋กœ ๋‹ด๋Š”๋‹ค.
    • ์ˆœ์„œ๋Œ€๋กœ ๋‹ด๊ธด task๋ฅผ worker๋“ค์ด ํ•˜๋‚˜์”จ ๊ฐ€์ ธ๊ฐ€์„œ ์ˆœ์„œ๋Œ€๋กœ ์‹คํ–‰๋œ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ์‹คํ–‰๋œ dag๋“ค์€ ์™„๋ฃŒ๋˜๋ฉด celery executor ๊ทธ๋ฆฌ๊ณ  metastore์— ๋ณด๊ณ ๊ฐ€ ๋œ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ์™„๋ฃŒ๋œ ์ƒํƒœ๋ฅผ UI์™€ Scheduler๊ฐ€ ๋‹ค์‹œ์ฝ์–ด์™€์„œ ์™„๋ฃŒ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•œ๋‹ค.

Airflow ๋™์ž‘ ๋ฐฉ์‹

  1. DAG๋ฅผ ์ž‘์„ฑํ•˜์—ฌ Workflow๋ฅผ ๋งŒ๋“ ๋‹ค. DAG๋Š” Task๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค
  2. Task๋Š” Operator๊ฐ€ ์ธ์Šคํ„ด์Šคํ™” ๋œ ๊ฒƒ
  3. DAG๋ฅผ ์‹คํ–‰์‹œํ‚ฌ ๋•Œ Scheduler๊ฐ€ DagRun ์˜ค๋ธŒ์ ํŠธ๋ฅผ ๋งŒ๋“ ๋‹ค
  4. DagRun ์˜ค๋ธŒ์ ํŠธ๋Š” Task Instance๋ฅผ ๋งŒ๋“ ๋‹ค
  5. Worker๊ฐ€ Task๋ฅผ ์ˆ˜ํ–‰ ํ›„ DagRun ์˜ ์ƒํƒœ๋ฅผ "์™„๋ฃŒ"๋กœ ๋ฐ”๊ฟ”๋†“๋Š”๋‹ค.

DAG์˜ ์ƒ์„ฑ๊ณผ ์‹คํ–‰

  • ์œ ์ €๊ฐ€ ์ƒˆ๋กœ์šด DAG๋ฅผ ์ž‘์„ฑ ํ›„ Folder DAGs ์•ˆ์— ๋ฐฐ์น˜
  • Web Server์™€ Scheduler๊ฐ€ DAG๋ฅผ ํŒŒ์‹ฑ
  • Scheduler๊ฐ€ Metastore๋ฅผ ํ†ตํ•ด DagRun ์˜ค๋ธŒ์ ํŠธ๋ฅผ ์ƒ์„ฑ
    • DagRun์€ ์‚ฌ์šฉ์ž๊ฐ€ ์ž‘์„ฑํ•œ DAG์˜ ์ธ์Šคํ„ด์Šค
    • DagRun status: Running
  • Scheduler๊ฐ€ Task Instance ์˜ค๋ธŒ์ ํŠธ (Dag run ์˜ค๋ธŒ์ ํŠธ์˜ ์ธ์Šคํ„ด์Šค == Task Instance) ๋ฅผ ์Šค์ผ€์ค„๋ง
  • Trigger๊ฐ€ ์ƒํ™ฉ์— ๋งž์œผ๋ฉด Scheduler๊ฐ€ Task Instance๋ฅผ Executor๋กœ ๋ณด๋ƒ„
  • Executor๊ฐ€ ๊ทธ Task๋ฅผ ์‹คํ–‰์‹œํ‚จ ๋‹ค์Œ, ์™„๋ฃŒํ›„ Metastore์— ์™„๋ฃŒํ–ˆ๋‹ค๊ณ  ๋ณด๊ณ ํ•œ๋‹ค. (์™„๋ฃŒ๋œ Task Instance๋Š” Dag Run์„ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค)
  • Scheduler๊ฐ€ Metastore๋ฅผ ํ†ตํ•ด์„œ DAG ์‹คํ–‰์ด ์™„๋ฃŒ๋๋‚˜ ํ™•์ธ์„ ํ•˜๊ณ  DagRun Status๋ฅผ Completed๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค.
  • Web Server๊ฐ€ Metastore๋ฅผ ํ†ตํ•ด์„œ DAG ์‹คํ–‰์ด ์™„๋ฃŒ๋๋‚˜ ํ™•์ธ์„ ํ•˜๊ณ  UI ์—…๋ฐ์ดํŠธ๋ฅผ ํ•œ๋‹ค.

Airflow ์„ค์น˜

# m1 ์—์„œ๋Š” ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ์„ค์น˜ ์•ˆ๋จ.
pip --version # anaconda ๋กœ ์„ค์น˜๋œ์ง€ ํ™•์ธ
pip install apache-airflow

airflow db init 
airflow werbserver -p 8080
airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
# m1 
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.1.1/docker-compose.yaml'
docker-compose up airflow-init
docker-compose up -d 

docker exec -it 64bb1d858ab5ad7babfad795a6e3dc60121e27b15a83c37bda4f54a6a /bin/sh # webserver container ์ ‘์† 
airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin

Airflow CLI command

  • airflow -h: ๊ฐ์ข… ๋ช…๋ น์–ด ์„ค๋ช… ๋ณด๊ธฐ
  • airflow webserver: webserver ์‹œ์ž‘
  • airflow users create ~~: user ์ถ”๊ฐ€
  • airflow scheduler: scheduler ์‹œ์ž‘
  • airflow db init: db์— ๊ธฐ๋ณธ์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ ๋ฐ ๊ธฐ๋ณธ ์„ค์ •
  • airflow dags list: ํ˜„์žฌ ๋Œ์•„๊ฐ€๋Š” dag๋“ค ์ถœ๋ ฅ
  • airflow tasks list example_xcom: example_xcom ์•ˆ์— ์กด์žฌํ•˜๋Š” task๋“ค ์กฐํšŒ
  • airflow dgas trigger -e 2022-01-01 example_xcom: ํŠน์ • dag๋ฅผ ํŠธ๋ฆฌ๊ฑฐ

Airflow DAGs ๋Œ€์‹œ๋ณด๋“œ

  • Owner: Dag ๊ด€๋ฆฌ์ž
  • Runs: ์‹คํ–‰ ์ค‘์ธ DAG์˜ ์ƒํƒœ
  • Schedule: ์ฃผ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์„ค์ •
  • Last Run: ์ตœ๊ทผ ์‹คํ–‰ ๋‚ ์งœ
  • Next Run: ๋‹ค์Œ ์‹คํ–‰์ด ์–ธ์ œ๋ ์ง€ ๋‚˜ํƒ€๋ƒ„
  • Recent Tasks: ๋ฐฉ๊ธˆ ์‹คํ–‰๋œ Task๋“ค์„ ๋ณด์—ฌ์คŒ
  • Actions: DAG๋ฅผ ์ง€์šฐ๊ฑฐ๋‚˜ ์‹คํ–‰
  • Links: ๋งˆ์šฐ์Šค ๊ฐ–๋‹ค๋Œ€๋ฉด ์—ฌ๋Ÿฌ๊ฐ€์ง€ Link๋“ค์ด ๋ณด์ž„

Airflow DAG View

image

  • Tree: Task๋“ค์˜ ์ƒํƒœ๋ฅผ ๋ณด๊ธฐ ํŽธํ•จ
  • Graph: Task๋“ค์˜ ์˜์กด์„ฑ์„ ํ™•์ธํ•  ๋•Œ ์ข‹์Œ, ๊ฐ Task๋“ค์˜ Log ์ •๋ณด ๋“ฑ์„ ํ™•์ธํ•˜๊ธฐ์—๋„ ์ข‹์Œ
  • Calendar: ๋‚ ์งœ๋ณ„๋กœ ์‹คํŒจ ์—†์ด ์ž˜ ๋Œ์•„๊ฐ”๋‚˜ ํ™•์ธ ๊ฐ€๋Šฅ
  • Task Duration, Task Tries, Landing Times: ๋‚ ์งœ๊ธฐ๋ฐ˜์œผ๋กœ ๋ญ”๊ฐ€ํ™•์ธ์ธ๋ฐ ์„ค์น˜ ์งํ›„์—” ๋ณผ๊ฒŒ ์—†์Œ
  • Gantt: ๊ฐ๊ฐ์˜ task๊ฐ€ ์‹คํ–‰ํ•˜๋ฉด์„œ ์–ผ๋งŒํผ์˜ ์‹œ๊ฐ„์„ ์†Œ๋น„ํ–ˆ๋‚˜ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • Details: ์—ฌ๋Ÿฌ๊ฐ€์ง€ Metadata ํ™•์ธ
  • Code: DAG ์ฝ”๋“œ ํ™•์ธ

NFT ํŒŒ์ดํ”„๋ผ์ธ ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ

  • OpenSea ์‚ฌ์ดํŠธ์˜ NFT๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•ด ํ…Œ์ด๋ธ”์— ์ €์žฅํ•˜๊ธฐ
  • ํ…Œ์ด๋ธ” ์ƒ์„ฑ -> API ํ™•์ธ -> NFT ์ •๋ณด ์ถ”์ถœ -> NFT ์ •๋ณด ๊ฐ€๊ณต -> NFT ์ •๋ณด ์ €์žฅ

NFT ํŒŒ์ดํ”„๋ผ์ธ - DAG Skeleton

./2-airflow/01-sqlite.py  # ๊ธฐ๋ณธ dag ๊ตฌ์„ฑ
# ์ƒ์„ฑ ํ›„ dag ๋Œ€์‹œ๋ณด๋“œ์— ๋“ฑ์žฅํ•˜๋Š”์ง€ ํ™•์ธ 

Airflow - ๋‚ด์žฅ Operators

  1. BashOperator
  2. PythonOperator
  3. EmailOperator

Airflow - Action Operator

  1. Action Operator๋Š” ์•ก์…˜์„ ์‹คํ–‰ํ•œ๋‹ค (๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœ, ๋ฐ์ดํ„ฐ ํ”„๋กœ์„ธ์‹ฑ ๋“ฑ)
  2. Transfer Operator๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฎ๊ธธ ๋•Œ ์‚ฌ์šฉ
  3. Sensors: ์กฐ๊ฑด์ด ๋งž์„ ๋•Œ ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฐ๋‹ค

NFT ํŒŒ์ดํ”„ ๋ผ์ธ - create table task ์ถ”๊ฐ€

  • Airflow ๋Œ€์‹œ๋ณด๋“œ -> Admin -> Connections -> ์ถ”๊ฐ€ -> connection id =db_sqlite, conneciton Type = Sqlite ๋กœ Save
./2-airflow/02-create-table.py
airflow tasks test nft-pipeline creating_table 2021-01-01 # task ์‹คํ–‰ 

NFT ํŒŒ์ดํ”„ ๋ผ์ธ - Sensor ๋กœ API ํ™•์ธํ•˜๊ธฐ

  • Airflow ๋Œ€์‹œ๋ณด๋“œ -> Admin -> Connections -> ์ถ”๊ฐ€ -> connection id = opensea_api, conneciton Type = http, host: https://api.opensea.io/ ๋กœ Save
./2-airflow/03-sensor.py
airflow tasks test nft-pipeline is_api_available 2021-01-01 # task ์‹คํ–‰ 

NFT ํŒŒ์ดํ”„ ๋ผ์ธ - OpenSea API ์˜ค๋ฅ˜ ๋Œ€์ฒ˜๋ฒ•

์ถœ์ฒ˜: https://github.com/keon/data-engineering/tree/main/02-airflow

  • Airflow ๋Œ€์‹œ๋ณด๋“œ -> Admin -> Connections -> ์ถ”๊ฐ€ -> connection id = githubcontent_api, conneciton Type = http, host: https://raw.githubusercontent.com/ ๋กœ Save
./2-airflow/03-sensor.py
airflow tasks test nft-pipeline is_api_available 2021-01-01 # task ์‹คํ–‰ 

NFT ํŒŒ์ดํ”„ ๋ผ์ธ - HttpOperator๋กœ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

./2-airflow/04-extract-data.py
airflow tasks test nft-pipeline extract_nft 2021-01-01 # task ์‹คํ–‰ 

NFT ํŒŒ์ดํ”„ ๋ผ์ธ - process

./2-airflow/05-process.py
airflow tasks test nft-pipeline process_nft 2021-01-01 # task ์‹คํ–‰ 

cat /tmp/processed_nft.csv # ๊ฒฐ๊ณผ ํ™•์ธ 

NFT ํŒŒ์ดํ”„ ๋ผ์ธ - store

./2-airflow/06-store.py

airflow tasks test nft-pipeline store_nft 2021-01-01 # task ์‹คํ–‰ 

# docker์—์„œ๋Š” 'airflow.db' ๊ฐ€ ๋”ฐ๋กœ ์—†๋Š”๋“ฏ. ๊ทธ๋ž˜์„œ ํ•ด๊ฒฐ์€ ๋ชปํ–ˆ์Œ.

NFT ํŒŒ์ดํ”„ ๋ผ์ธ - ํ…Œ์Šคํฌ๊ฐ„ ์˜์กด์„ฑ ๋งŒ๋“ค๊ธฐ

./2-airflow/07-dependency.py

airflow์—์„œ DAG ํ™œ์„ฑํ™” ํ•ด์„œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š”์ง€ ํ™•์ธ.

Backfill

  • ๋งค์ผ ์ฃผ๊ธฐ์ ์œผ๋กœ ๋Œ์•„๊ฐ€๋Š” ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ฉˆ์ท„๋‹ค๊ฐ€ ๋ช‡์ผ ๋’ค ์‹คํ–‰์‹œํ‚ค๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ํ•˜๋ฃจ์— ํ•œ๋ฒˆ์”ฉ ๋Œ์•„๊ฐ€๋Š” DAG๊ฐ€ 1์›”1์ผ์— ์‹คํ–‰๋๋‹ค๊ฐ€, 1์›”2์ผ์— ๋ฉˆ์ท„์—ˆ๊ณ  1์›”4์ผ์— ๋‹ค์‹œ ์‹œ์ž‘ํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
    • DAG ์„ค์ • ์ฝ”๋“œ์— catchup(False)์ด๋ฉด 1์›” 4์ผ์— ๋‹ค์‹œ ์‹œ์ž‘ํ•˜๋ฉด 1์›” 4์ผ๊ธฐ์ค€์œผ๋กœ ๋Œ์•„๊ฐ„๋‹ค.
    • DAG ์„ค์ • ์ฝ”๋“œ์— catchup(True)์ด๋ฉด 1์›” 4์ผ์— ๋‹ค์‹œ ์‹œ์ž‘ํ•˜๋ฉด 1์›” 2์ผ๊ธฐ์ค€์œผ๋กœ ๋Œ์•„๊ฐ„๋‹ค.
  • DAG ์‹œ์ž‘ ๋‚ ์งœ๋ฅผ 2021-01-01๋กœ ํ•ด๋‘๊ณ , ํ˜„์žฌ 2022-08-06์— catcup(True)๋กœํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?
    • ๊ธฐ์กด์— ์ด๋ฏธ ์‹คํ–‰๋œ๊ฒŒ ์žˆ์œผ๋ฉด ๋Œ์•„๊ฐ€์ง€ ์•Š๋Š”๋‹ค. -> ๊ธฐ์กด DAG๋ฅผ ์ง€์šฐ๊ณ , Browse -> DAG Run -> nft-pipeline ์ œ๊ฑฐ
    • ์ œ๊ฑฐํ•˜๊ณ ๋‚˜๋ฉด ๋ฐ”๋กœ 1์›”1์ผ๋ถ€ํ„ฐ ๊ฑฐ์˜ 1๋…„์น˜๊ฐ€ ๋™์‹œ์— ๋Œ์•„๊ฐ€๊ฒŒ ๋œ๋‹ค.

Airflow๋กœ Spark ํŒŒ์ดํ”„๋ผ์ธ ๊ด€๋ฆฌํ•˜๊ธฐ - Airflow์™€ Spark ํ™˜๊ฒฝ์„ธํŒ… ๋ฐ ์‚ฌ์šฉํ•˜๊ธฐ

1. webserver docker ์ ‘์†
2. pip install apache-airflow-providers-apache-spark
3. fhvhv_tripdata_2020-03.csv ํŒŒ์ผ webserver๋กœ ์ „์†ก
# webserver docker์—์„œ count_trips.py ์ž‘์„ฑ 
# ํŒจํ‚ค์ง€๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ 
from pyspark import SparkConf, SparkContext
import pandas as pd

# Spark ์„ค์ •
conf = SparkConf().setMaster("local").setAppName("uber-date-trips")
sc = SparkContext(conf=conf)

# ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ ธ์˜ฌ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ํŒŒ์ผ
directory = "/home/airflow/data"
filename = "fhvhv_tripdata_2020-03.csv"

# ๋ฐ์ดํ„ฐ ํŒŒ์‹ฑ
lines = sc.textFile(f"file:///{directory}/{filename}")
header = lines.first() 
filtered_lines = lines.filter(lambda row:row != header) 

# ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ๊ณจ๋ผ๋‚ด์„œ ์„ธ๋Š” ๋ถ€๋ถ„
# countByValue๋กœ ๊ฐ™์€ ๋‚ ์งœ๋“ฑ์žฅํ•˜๋Š” ๋ถ€๋ถ„์„ ์„ผ๋‹ค
dates = filtered_lines.map(lambda x: x.split(",")[2].split(" ")[0])
result = dates.countByValue()

# ์•„๋ž˜๋Š” Spark์ฝ”๋“œ๊ฐ€ ์•„๋‹Œ ์ผ๋ฐ˜์ ์ธ ํŒŒ์ด์ฌ ์ฝ”๋“œ
# CSV๋กœ ๊ฒฐ๊ณผ๊ฐ’ ์ €์žฅ 
pd.Series(result, name="trips").to_csv("trips_date.csv")
Admin -> Connectors -> ์ถ”๊ฐ€ 
Connect id: spark_local
Connection Type: Spark
Host: local 

Save
airflow tasks test spark-example submit_job 2021-01-01

./2-airflow/dags/spark-example.py # ์ฝ”๋“œ ์œ„์น˜ 

ํƒ์‹œ๋น„ ์˜ˆ์ธก ํŒŒ์ดํ”„๋ผ์ธ ๋งŒ๋“ค๊ธฐ

./2-airflow/taxi-price.py

12. Kafka

์ „ํ†ต์ ์ธ ์•„ํ‚คํ…์ณ

image

  • SystemA, SystemB ๊ฐ๊ฐ ๋ฐ์ดํ„ฐ ์Œ“์ธ ๊ฒƒ์„ Data Lake๋กœ ๋ณด๋‚ด๋Š” ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฐ๊ฐ ๋งŒ๋“ค์–ด์ค˜์•ผ ํ•จ.

์ „ํ†ต์ ์ธ ์•„ํ‚คํ…์ฒ˜์˜ ๋ฌธ์ œ์ 

image

  • ์‹œ์Šคํ…œ์„ ๋”ํ• ์ˆ˜๋ก ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๋ณต์žกํ•ด์ง„๋‹ค.
  • ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ†ต์‹  ํ”„๋กœํ† ์ฝœ์„ ์ง€์›ํ•ด์•ผ ํ•œ๋‹ค (HTTP, GRPC, TCP, MQ)
  • ๋ฐ์ดํ„ฐ ํฌ๋ฉง๋„ ๋‹ค๋ฅด๋‹ค (CSV, JSON, XML)
  • Point-of-failure ๊ฐ€ ๋งŽ๋‹ค
    • ์‹œ์Šคํ…œ A,B,C,D,E,F ๊ฐ๊ฐ์˜ ์‹ ๋ขฐ๋„๊ฐ€ 99% ๋ผ๊ณ  ํ–ˆ์„ ๋•Œ
    • ์‹œ์Šคํ…œ A,B,C,D,E,F๋ฅผ ๋ฌถ์—ˆ์„ ๋•Œ์˜ ์‹ ๋ขฐ๋„ = 99% ^6 = 94.1%
  • ๊ฐ๊ฐ์˜ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ ์–ด๋””์„œ ์—๋Ÿฌ๊ฐ€ ๋‚˜๊ณ  ์žˆ๋Š”์ง€ ๋ชจ๋‹ˆํ„ฐ๋ง ํ•˜๊ธฐ๋„ ํž˜๋“ค๋‹ค

Kafka ์†Œ๊ฐœ 1

  • LinkedIn์—์„œ ๊ฐœ๋ฐœ
  • Apache Software๋กœ ๋„˜์–ด๊ฐ€ 2011๋…„ ์˜คํ”ˆ์†Œ์Šคํ™”
  • Apple, eBay, Uber, ArBnB, Netflix ๋“ฑ์—์„œ ์‚ฌ์šฉ์ค‘

Kafka ์†Œ๊ฐœ 2

  • ๋ถ„์‚ฐ ์ŠคํŠธ๋ฆฌ๋ฐ ํ”Œ๋žซํผ
  • Source ์‹œ์Šคํ…œ์€ Kafka๋กœ ๋ฉ”์‹œ์ง€๋ฅผ ๋ณด๋‚ด๊ณ 
  • Destination ์‹œ์Šคํ…œ์€ Kafka๋กœ ๋ถ€ํ„ฐ ๋ฉ”์‹œ์ง€๋ฅผ ๋ฐ›๋Š”๋‹ค
  • ํ™•์žฅ์„ฑ์ด ์žˆ๊ณ , ์žฅ์•  ํ—ˆ์šฉ (fault tolerant)์„ ํ•˜๋ฉฐ, ์„ฑ๋Šฅ์ด ์ข‹๋‹ค.

Kafka๋ฅผ ์ด์šฉํ•œ ์•„ํ‚คํ…์ณ

image

  • ์‹œ์Šคํ…œ๊ฐ„ ์˜์กด์„ฑ์„ ๊ฐ„์ ‘์ ์œผ๋กœ ๋งŒ๋“ ๋‹ค
  • ํ™•์žฅ์„ฑ: ์ƒˆ ์‹œ์Šคํ…œ์„ ๋”ํ•  ๋•Œ ๋งˆ๋‹ค ๋ณต์žก๋„๊ฐ€ ์„ ํ˜•์ ์œผ๋กœ ์˜ฌ๋ผ๊ฐ„๋‹ค
  • Kafka๋ฅผ ์ด์šฉํ•ด ํ†ต์‹  ํ”„๋กœํ† ์ฝœ์„ ํ†ตํ•ฉํ•˜๊ธฐ ์‰ฝ๋‹ค

Kafka์˜ ์žฅ์ ๋“ค

  • ํ™•์žฅ์„ฑ: ํ•˜๋ฃจ์— 1์กฐ๊ฐœ์˜ ๋ฉ”์‹œ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ณ , Petabyte์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • ๋ฉ”์‹œ์ง€ ์ฒ˜๋ฆฌ ์†๋„: 2MS
  • ๊ฐ€์šฉ์„ฑ(availability): ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ์ž‘๋™
  • ๋ฐ์ดํ„ฐ ์ €์žฅ ์„ฑ๋Šฅ: ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ, ๋‚ด๊ตฌ์„ฑ, ์žฅ์•  ํ—ˆ์šฉ (fault tolerant)

Kafka ์‚ฌ์šฉ ์˜ˆ

  • ์‹œ์Šคํ…œ๊ฐ„ ๋ฉ”์‹œ์ง€ ํ
  • ๋กœ๊ทธ ์ˆ˜์ง‘
  • ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ
  • ์ด๋ฒคํŠธ ๋“œ๋ฆฌ๋ธ ๊ธฐ๋Šฅ๋“ค
  • Netflix: ์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ๋ง
  • Expedia: ์ด๋ฒคํŠธ ๋“œ๋ฆฌ๋ธ ์•„ํ‚คํ…์ฒ˜
  • Uber: ์‹ค์‹œ๊ฐ„ ๊ฐ€๊ฒฉ ์กฐ์ •, ์‹ค์‹œ๊ฐ„ ์ˆ˜์š” ์˜ˆ์ธก

Kafka ๊ตฌ์„ฑ

  • Topic
  • Kafka Broker
  • Kafka Producer
  • Kafka Consumer
  • Kafka Partition
  • Kafka Message
  • Kafka Offset
  • Kafka Consumer Group
  • Kafka Cluster
  • Zookeeper

Kafka๋ฅผ ์ด์šฉํ•œ ์•„ํ‚คํ…์ฒ˜ - ์ƒ์„ธ

image

Kafka Topic

  • Producer ์™€ Consumer๊ฐ€ ์†Œํ†ต์„ ํ•˜๋Š” ํ•˜๋‚˜์˜ ์ฑ„๋„
  • ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์ด ์–ด๋””์— Publish ๋ ์ง€ ์ •ํ•˜๋Š”๋ฐ ์“ฐ์ž„
    • ํ† ํ”ฝ์€ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ํด๋”์˜ ๊ฐœ๋…๊ณผ ์œ ์‚ฌํ•˜๋‹ค.
  • Producer๋Š” ํ† ํ”ฝ์„ ์ง€์ •ํ•˜๊ณ  ๋ฉ”์‹œ์ง€๋ฅผ ๊ฒŒ์‹œ (Post)
  • Consumer๋Š” ํ† ํ”ฝ์œผ๋กœ๋ถ€ํ„ฐ ๋ฉ”์‹œ์ง€๋ฅผ ๋ฐ›์•„์˜ด
  • ์นดํ”„์นด์˜ ๋ฉ”์‹œ์ง€๋Š” ๋””์Šคํฌ์— ์ •๋ ฌ๋˜์–ด ์ €์žฅ ๋˜๋ฉฐ, ์ƒˆ๋กœ์šด ๋ฉ”์‹œ์ง€๊ฐ€ ๋„์ฐฉํ•˜๋ฉด ์ง€์†์ ์œผ๋กœ ๋กœ๊ทธ์— ๊ธฐ๋ก

Kafka Partition

  • Kafka Topic์ด Partition์œผ๋กœ ๋‚˜๋‰œ๋‹ค.
  • Partition์€ ๋””์Šคํฌ์— ์–ด๋–ป๊ฒŒ ์ €์žฅ์ด ๋˜๋Š”์ง€ ๊ฐ€๋ฅด๋Š” ๊ธฐ์ค€์ด ๋œ๋‹ค.
  • ์นดํ”„์นด์˜ ํ† ํ”ฝ์€ ํŒŒํ‹ฐ์…˜์˜ ๊ทธ๋ฃน์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋””์Šคํฌ์—๋Š” ํŒŒํ‹ฐ์…˜ ๋‹จ์œ„๋กœ ์ €์žฅ
  • ํŒŒํ‹ฐ์…˜๋งˆ๋‹ค commit Log ๊ฐ€ ์Œ“์ด๊ฒŒ ๋œ๋‹ค
  • ํŒŒํ‹ฐ์…˜์— ์Œ“์ด๋Š” ๊ธฐ๋ก๋“ค์€ ์ •๋ ฌ์ด ๋˜์–ด ์žˆ๊ณ  ๋ถˆ๋ณ€(immutable)ํ•˜๋‹ค
  • ํŒŒํ‹ฐ์…˜์˜ ๋ชจ๋“  ๊ธฐ๋ก๋“ค์€ Offset์ด๋ผ๋Š” ID๋ฅผ ๋ถ€์—ฌ๋ฐ›๋Š”๋‹ค.

Kafka Message

  • ์นดํ”„์นด์˜ ๋ฉ”์‹œ์ง€๋Š” Byte์˜ ๋ฐฐ์—ด
  • ํ”ํžˆ ๋‹จ์ˆœ String, JSON์ด๋‚˜ Avro ์‚ฌ์šฉ
  • ํฌ๊ธฐ์—๋Š” ์ œํ•œ์ด ์—†์ง€๋งŒ, ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ๋Š” ์ž‘๊ฒŒ ์œ ์ง€ํ•˜๋Š”๊ฒƒ์ด ์ข‹๋‹ค
  • ๋ฐ์ดํ„ฐ๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ ์‹œ๊ฐ„๋งŒํผ ์ €์žฅํ•œ๋‹ค (Retention Period), topic ๋ณ„๋กœ ์ง€์ •๋„ ๊ฐ€๋Šฅ
  • Consumer๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„๊ฐ€๊ณ  ๋‚˜์„œ๋„ ๋ฐ์ดํ„ฐ๋Š” ์ €์žฅ๋œ๋‹ค
  • Retention Period๊ฐ€ ์ง€๋‚˜๋ฉด ๋ฐ์ดํ„ฐ๋Š” ์ž๋™์œผ๋กœ ์‚ญ์ œ
    • ์žฅ์• ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ, Retention Period ๊ธฐ๊ฐ„ ์•ˆ์— ํ•ด๊ฒฐ์„ ํ•ด์•ผ ํ•œ๋‹ค.
    • Retention Period ์ง€๋‚œ ํ›„์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ์„ ๊ฒฝ์šฐ, Data Lake ๊นŒ์ง€ ๋‚ด๋ ค๊ฐ€์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด์™€์„œ ํ”„๋กœ์„ธ์‹ฑ ํ•ด์•ผ ํ•œ๋‹ค

Kafka Offset

  • ๋ณด๋‚ด๋Š” ๋ฉ”์‹œ์ง€๋Š” Offset์„ ๊ฐ€์ง€๊ฒŒ๋œ๋‹ค.
  • Offset์€ Partition์•ˆ์— ๋ฉ”์‹œ์ง€๊ฐ€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ๋˜๋Š”๋ฐ, ์ •๋ ฌ๋œ ์ˆœ์„œ ๋ฐ ๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค.

Kafka Cluster

  • ์นดํ”„์นด ํด๋Ÿฌ์Šคํ„ฐ๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ ์นดํ”„์นด ๋ธŒ๋กœ์ปค(์„œ๋ฒ„)๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋”ฐ
  • ์นดํ”„์นด ํ† ํ”ฝ์„ ์ƒ์„ฑํ•˜๋ฉด ๋ชจ๋“  ์นดํ”„์นด ๋ธŒ๋กœ์ปค์— ์ƒ์„ฑ๋œ๋‹ค
  • ์นดํ”„์นด ํŒŒํ‹ฐ์…˜์€ ์—ฌ๋Ÿฌ ๋ธŒ๋กœ์ปค์— ๊ฑธ์ณ์„œ ์ƒ์„ฑ๋œ๋‹ค

Kafka Broker

  • ์นดํ”„์นด์˜ ์„œ๋ฒ„๋กœ๋„ ๋ถˆ๋ฆฐ๋‹ค.
  • Topic์„ ์ „๋‹ฌํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

Kafka Producer & Consumer

  • Producer: ๋ฉ”์‹œ์ง€๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์ฃผ์ฒด
    • ์นดํ”„์นด ํ† ํ”ฝ์œผ๋กœ ๋ฉ”์‹œ์ง€๋ฅผ ๊ฒŒ์‹œ(post)ํ•˜๋Š” ํด๋ผ์–ด์ธํŠธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜
    • ๋ฉ”์‹œ์ง€๋ฅผ ์–ด๋А ํŒŒํ‹ฐ์…˜์— ๋„ฃ์„์ง€ ๊ฒฐ์ • (key)
  • Consumer: ๋ฉ”์‹œ์ง€๋ฅผ ์ „๋‹ฌ๋ฐ›๋Š” ์ฃผ์ฒด

Kafka Consumer Group

  • Consumer๋ฅผ ๋ฌถ์–ด์„œ Consumer Group์ด๋ผ๊ณ  ํ•œ๋‹ค.
  • Consumer 1๊ฐœ๊ฐ€ Consumer Group์ด ๋  ์ˆ˜ ์žˆ๊ณ , ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ๋  ์ˆ˜ ๋„ ์žˆ๋‹ค.
  • Consumer Group์„ ๋ณ„๋„๋กœ ์ง€์ •์•ˆํ•˜๋ฉด, Consumer 1๊ฐœ๋‹น Group1๊ฐœ์”ฉ ์ง€์ •๋œ๋‹ค.
  • ๊ฐ Consumer Group์€ ๋ชจ๋“  ํŒŒํ‹ฐ์…˜์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.
    • Consumer๋Š” ์ง€์ •๋œ ํŒŒํ‹ฐ์…˜์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.
    • Consumer1,2๊ฐ€ Consumer Group์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ, ๊ฐ Consumer๋งˆ๋‹ค ํŠน์ • ์ง€์ •๋œ ํŒŒํ‹ฐ์…˜์— ๋Œ€ํ•ด์„œ๋งŒ ๋ฐ์ดํ„ธ๋ฅด ์ „๋‹ฌ ๋ฐ›๊ฒŒ ๋œ๋‹ค.

Rebalancing

  • Partition 4๊ฐœ, Consumer Group์•ˆ์— Consumer๊ฐ€ 3๊ฐœ ์žˆ๋Š”๊ฒฝ์šฐ, 3๊ฐœ ๊ฐ ํŒŒํ‹ฐ์…˜๋งˆ๋‹ค Consumer์— ํ• ๋‹น๋˜๊ณ  ๋‚จ์€ 1๊ฐœ์˜ ํŒŒํ‹ฐ์…˜์€ Consumer์ค‘์— ๋žœ๋ค์œผ๋กœ ๋ฐฐ์ •๋œ๋‹ค.
  • ๊ทผ๋ฐ ์—ฌ๊ธฐ์—์„œ Consumer Group์•ˆ์— Consumer๊ฐ€ 1๊ฐœ๊ฐ€ ์ถ”๊ฐ€๋˜๋Š” ๊ฒฝ์šฐ Rebalancing์ด ์ผ์–ด๋‚œ๋‹ค.
  • ๋‚จ์€ 1๊ฐœ์˜ ํŒŒํ‹ฐ์…˜์ด ์ƒˆ๋กœ ์ถ”๊ฐ€๋œ Consumer๋กœ ์ „๋‹ฌ๋˜๋„๋ก Rebalancing์ด ์ผ์–ด๋‚œ๋‹ค.
  • Consumer๊ฐ€ ์ œ๊ฑฐ๋˜๊ฑฐ๋‚˜ ์ถ”๊ฐ€๋  ๋•Œ rebalancing์ด ์ด๋ฃจ์–ด ์ง„๋‹ค.

Zookeeper

  • ์นดํ”„์นด ํด๋Ÿฌ์Šคํ„ฐ์˜ ์—ฌ๋Ÿฌ ์š”์†Œ๋“ค์„ ์„ค์ •ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋จ
  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์„ค์ •, ํ† ํ”ฝ ์„ค์ •, Replication Factor ๋“ฑ์„ ์กฐ์ ˆํ•˜๋Š”๋ฐ ์‚ฌ์šฉ
  • ๋ถ„์‚ฐ ์‹œ์Šคํ…œ๊ฐ„์˜ ์ •๋ณด ๊ณต์œ , ์ƒํƒœ ์ฒดํฌ, ์„œ๋ฒ„๋“ค ๊ฐ„์˜ ๋™๊ธฐํ™”
  • ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์˜ ์ผ๋ถ€์ด๊ธฐ ๋•Œ๋ฌธ์— ๋™์ž‘์„ ๋ฉˆ์ถ˜๋‹ค๋ฉด ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์— ์˜ํ–ฅ
  • ์ฃผํ‚คํผ ์—ญ์‹œ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ๊ตฌ์„ฑ
  • ํด๋Ÿฌ์Šคํ„ฐ๋Š” ํ™€์ˆ˜๋กœ ๊ตฌ์„ฑ๋˜์–ด ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ์„ ๊ฒฝ์šฐ ๊ณผ๋ฐ˜์ˆ˜๊ฐ€ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ผ๊ด€์„ฑ ์œ ์ง€
  • ํ•˜๋Š”์ผ
    • ํด๋Ÿฌ์Šคํ„ฐ๊ด€๋ฆฌ: ํด๋Ÿฌ์Šคํ„ฐ์— ์กด์žฌํ•˜๋Š” ๋ธŒ๋กœ์ปค๋ฅผ ๊ด€๋ฆฌํ•˜๊ณ  ๋ชจ๋‹ˆํ„ฐ๋ง
    • Topic ๊ด€๋ฆฌ: ํ† ํ”ฝ ๋ฆฌ์ŠคํŠธ๋ฅผ ๊ด€๋ฆฌํ•˜๊ณ  ํ† ํ”ฝ์— ํ• ๋‹น๋œ ํŒŒํ‹ฐ์…˜๊ณผ Replication๊ด€๋ฆฌ
    • ํŒŒํ‹ฐ์…˜ ๋ฆฌ๋” ๊ด€๋ฆฌ: ํŒŒํ‹ฐ์…˜์˜ ๋ฆฌ๋”๊ฐ€ ๋  ๋ธŒ๋กœ์ปค๋ฅผ ์„ ํƒํ•˜๊ณ , ๋ฆฌ๋”๊ฐ€ ๋‹ค์šด๋  ๊ฒฝ์šฐ ๋‹ค์Œ ๋ฆฌ๋”๋ฅผ ์„ ํƒ
    • ๋ธŒ๋กœ์ปค๋“ค๋ผ๋ฆฌ ์„œ๋กœ๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์ •๋ณด ์ „๋‹ฌ

Key ์— ๋”ฐ๋ฅธ Message ์ „์†ก

  • Key ์—†์ด ์ „์†ก: Producer๊ฐ€ ๋ฉ”์‹œ์ง€๋ฅผ ๊ฒŒ์‹œํ•˜๋ฉด Round-Robin ๋ฐฉ์‹์œผ๋กœ ํŒŒํ‹ฐ์…˜์— ๋ถ„๋ฐฐํ•œ๋‹ค.
  • Key ์™€ํ•จ๊ป˜ ์ „์†ก: ๊ฐ™์€ Key๋ฅผ ๊ฐ€์ง„ ๋ฉ”์‹œ์ง€๋“ค์€ ๊ฐ™์€ ํŒŒํ‹ฐ์…˜์—๊ฒŒ ๋ณด๋‚ด์ง„๋‹ค

Replication Factor

image

ํŒŒํ‹ฐ์…˜ ๋ฆฌ๋”

  • ๊ฐ ๋ธŒ๋กœ์ปค๋Š” ๋ณต์ œ๋œ ํŒŒํ‹ฐ์…˜์ค‘ ๋Œ€ํ‘œ๋ฅผ ํ•˜๋Š” ํŒŒํ‹ฐ์…˜ ๋ฆฌ๋”๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค.
  • ๋ชจ๋“  Read/Write๋Š” ํŒŒํ‹ฐ์…˜ ๋ฆฌ๋”๋ฅผ ํ†ตํ•ด์„œ ์ด๋ฃจ์–ด์ง€๊ฒŒ ๋จ
  • ๋‹ค๋ฅธ ํŒŒํ‹ฐ์…˜๋“ค์€ ํŒŒํ‹ฐ์…˜ ๋ฆฌ๋”๋ฅผ ๋ณต์ œ

Consumer Group & Partition & Producer

  • Partition์„ 1๊ฐœ๋กœ ๋งŒ๋“ค์–ด๋†“๊ณ  Consumer Group์•ˆ์— Consumer๋ฅผ 2๊ฐœ๋กœ ๋งŒ๋“ ๋‹ค๋ฉด
    • Producer์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์•„๋ฌด๋ฆฌ ๋ณด๋‚ด๋„ Consumer1 ๋กœ๋งŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‚ด๊ฒŒ๋œ๋‹ค.
  • Partition์„ 2๊ฐœ๋กœ ๋งŒ๋“ค์–ด๋†“๊ณ  Consumer Group์•ˆ์— Consumer๋ฅผ 2๊ฐœ๋กœ ๋งŒ๋“ ๋‹ค๋ฉด
    • Producer์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‚ด๋ฉด, Consumer1,2 ๊ฐ๊ฐ์— ๊ท ๋“ฑํ•˜๊ฒŒ ๋ณด๋‚ด๊ฒŒ ๋œ๋‹ค.

Kafka python ์„ค์น˜

pip install kafka-python

Kafka pyhton Consumer Producer ๊ฐ„๋‹จ์˜ˆ์ œ

./3-kafka/consumer.py
./3-kafka/producer.py

zookeeper, kafka, kafdrop ๋ฅผ docker-compose๋กœ ์‹คํ–‰ํ•˜๊ธฐ

# m1 ์—์„œ๋„ ์ž˜ ์ž‘๋™ํ•จ.
./3-kafka/docker-compose.yml

kafka topic ์ƒ์„ฑ

docker exec -it 03-kafka_kafka1_1 kafka-topics --bootstrap-server=localhost:19091 --create --topic first-cluster-topic --partitions 3 --replication-factor 1

# kafdrop ์—์„œ๋„ ui๋กœ topic ์ƒ์„ฑ ๊ฐ€๋Šฅ

CSV๋ฅผ ์ŠคํŠธ๋ฆผ์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” Producer

./3-kafka/trips_producer.py
./3-kafka/trips_consumer.py

๋น„์ •์ƒ ๋ฐ์ดํ„ฐ ํƒ์ง€

  • payment_producer์—์„œ ๋žœ๋ค payment ๋ฐ์ดํ„ฐ๋“ค์„ kafka payment ํ† ํ”ฝ์œผ๋กœ ์ „์†ก ํ•œ๋‹ค. (producer)
  • fraud_detector์—์„œ payment ํ† ํ”ฝ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ „๋‹ฌ๋ฐ›์•„ ๋น„ํŠธ์ฝ”์ธ ๋ฐ์ดํ„ฐ๋ฉด fraud_payments(์‚ฌ๊ธฐ) ํ† ํ”ฝ์œผ๋กœ ์ „์†กํ•˜๊ณ  ์ •์ƒ ๋ฐ์ดํ„ฐ๋ฉด legit_payments ํ† ํ”ฝ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์†กํ•œ๋‹ค. ์ฆ‰, consumer์™€ producer๊ฐ€ ๋‘˜๋‹ค ๊ณต์กดํ•˜๊ณ  ์žˆ๋‹ค.
  • legit_processor๋Š” ์ •์ƒ ๋ฐ์ดํ„ฐ๋“ค์„ ์ „๋‹ฌ๋ฐ›์•„ ์ฒ˜๋ฆฌํ•˜๋Š” consumer์ด๋‹ค.
  • fraud_processor๋Š” ๋น„์ •์ƒ ๋ฐ์ดํ„ฐ๋“ค์„ ์ „๋‹ฌ๋ฐ›์•„ ์ฒ˜๋ฆฌํ•˜๋Š” consumer์ด๋‹ค.
./3-kafka/fraud_detection/*

13. Flink์™€ ์ŠคํŠธ๋ฆฌ๋ฐ ํ”„๋กœ์„ธ์‹ฑ

Apache Flink ๋ž€

  • Spark: ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ์„ ์œ„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ
  • Flink: ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ์„ ์œ„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ
  • 2009๋…„ ๊ฐœ๋ฐœ ์‹œ์ž‘ ~ 2016๋…„ ์ฒซ stable ๋ฒ„์ „ ๊ณต๊ฐœ

Flink ์†Œ๊ฐœ

  • ์˜คํ”ˆ์†Œ์Šค ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ
  • ๋ถ„์‚ฐ์ฒ˜๋ฆฌ / ๊ณ ์„ฑ๋Šฅ / ๊ณ ๊ฐ€์šฉ์„ฑ
  • ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ ๋˜ํ•œ ์ง€์›ํ•œ๋‹ค
  • Spark๋ณด๋‹ค ๋น ๋ฅธ ์†๋„
  • Fault-tolerance: ์‹œ์Šคํ…œ ์žฅ์• ์‹œ ์žฅ์•  ์ง์ „์œผ๋กœ ๋Œ์•„๊ฐ€์„œ ๋‹ค์‹œ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹ค
  • ํ™œ๋ฐœํ•œ ๊ฐœ๋ฐœ - ๊ทธ๋ž˜ํ”„ ํ”„๋กœ์„ธ์‹ฑ, ๋จธ์‹ ๋Ÿฌ๋‹, ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ, ๋“ฑ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ / ํ”„๋ ˆ์ž„์›Œํฌ์™€ ์—ฐ๋™
  • Rescalability: ์‹คํ–‰ ๋„์ค‘ ๋ฆฌ์†Œ์Šค ์ถ”๊ฐ€ ๊ฐ€๋Šฅ

Stream Processing์€ ์–ธ์ œ ์“ฐ์ผ๊นŒ

  • ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ์€ ํ•œ์ •๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค๋ค˜๋‹ค๋ฉด, ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ์€ ๋ฌดํ•œํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ฌ ์ˆ˜ ์žˆ์„ ๋•Œ ๋‹ค๋ฃฌ๋‹ค.
  • ์ฃผ์‹ ๊ฑฐ๋ž˜์†Œ
  • ์›น ์„œ๋ฒ„
  • ์„ผ์„œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ์ด๋ฒคํŠธ ๋“œ๋ฆฌ๋ธ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜
  • ๋น„์ •์ƒ ๊ฑฐ๋ž˜ ํƒ์ง€

Batch Processing vs Stream Processing

  • Batch
    • ํ•œ์ •๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ์‚ฌ์šฉ
    • ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์„ ์ฝ์€ ํ›„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
    • ์ฃผ๊ธฐ์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ์ž‘์—…
    • ์ฒ˜๋ฆฌ์†๋„๋ณด๋‹ค๋Š” ์ฒ˜๋ฆฌ๋Ÿ‰์— ํฌ์ปค์Šค
  • Stream
    • ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฌดํ•œ์ด๋ผ๊ณ  ๊ฐ€์ •
    • ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์ฐฉํ•  ๋•Œ ๋งˆ๋‹ค ์ฒ˜๋ฆฌ
    • ์‹ค์‹œ๊ฐ„์œผ๋กœ ์‹คํ–‰๋˜๋Š” ์ž‘์—…
    • ์ฒ˜๋ฆฌ๋Ÿ‰๋ณด๋‹ค ์ฒ˜๋ฆฌ์†๋„์— ํฌ์ปค์Šค

Flink์˜ ๊ธฐ๋ณธ์ ์ธ ์ฒ˜๋ฆฌ ๊ตฌ์กฐ

  • Streaming Dataflow:
    • Sources: ํ•œ๊ฐœ ํ˜น์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์†Œ์Šค๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค
    • Operators: ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ (transformation)
    • Sink: ๋ฐ์ดํ„ฐํ”Œ๋กœ์šฐ์˜ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„
  • ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ ๋ถ€ํ„ฐ ์ฝ์–ด์™€์„œ, Sink๋ฅผ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ ๋ณด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

Hadoop vs Spark vs Flink ํŠน์ง• ๋น„๊ต

  • Hadoop
    • Batch Processing
    • Disk์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๊ณ  ์ฒ˜๋ฆฌ
  • Spark
    • Hadoop์—์„œ ๊ฐœ์„ ํ•ด์„œ ๋งŒ๋“  ํ”„๋กœ์ ํŠธ
    • Hadoop์— ๋น„ํ•ด ์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค
    • Batch Processing
    • (Batch based Streaming) -> micro batch๋กœ streaming์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์žˆ๋‹ค
    • In-Memory ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • Flink
    • Stream Processing
    • In-Memory ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ

Hadoop vs Spark vs Flink ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐฉ์‹ ๋น„๊ต

  • Hadoop
    • Input ---Mapper--> ์ƒํƒœ1 ---Mapper(Disk)--> Reducer --> Output
    • Mapper๋ฅผ ํ†ตํ•ด์„œ Reducer์— ์ „๋‹ฌ์ด ๋  ๋•Œ Disk๋ฅผ ๊ฑฐ์น˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ํž˜๋“ค๋‹ค (Disk๋ฅผ ๊ฑฐ์น˜๋Š”๊ฒŒ ์‹œ๊ฐ„ ์†Œ์š”๊ฐ€ ๋งŽ์ด ๋œ๋‹ค)
  • Spark
    • Input --> ์ƒํƒœ1 --Transformation(in-memory)--> ์ƒํƒœ2 --> Output
    • in-memory tranformation์„ ํ†ตํ•ด์„œ Hadoop์— ๋น„ํ•ด ํ›จ์”ฌ ์„ฑ๋Šฅ์ด ๋น ๋ฅด๋‹ค.
  • Flink
    • Input --> ์ƒํƒœ1 --Transformation(in-memory)--> ์ƒํƒœ2 --> Output
    • Flow์ž์ฒด๋Š” Spark์™€ ๋งค์šฐ ์œ ์‚ฌํ•œ๋ฐ, Batch Processing์ด๋ƒ Stream Processing์ด๋ƒ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

Hadoop vs Spark vs Flink ๊ฐœ๋ฐœ ํŽธ์˜์„ฑ ๋น„๊ต

  • Hadoop
    • ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ์†์ˆ˜ ์ฝ”๋”ฉํ•ด์ค˜์•ผ ํ•œ๋‹ค
    • ๋‚ฎ์€ ๋‹จ๊ณ„์˜ ์ถ”์ƒํ™”
  • Spark
    • ๋†’์€ ๋‹จ๊ณ„์˜ ์ถ”์ƒํ™”
    • ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ
    • RDD
  • Flink
    • ๋†’์€ ๋‹จ๊ณ„์˜ ์ถ”์ƒํ™”
    • ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ
    • Dataflows
  • Spark & Flink ๋ชจ๋‘ ๊ฐœ๋ฐœ ์ปค๋ฎค๋‹ˆํ‹ฐ๊ฐ€ ํ™œ์„ฑํ™” ๋˜์–ด ์žˆ๊ณ , API ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๊ฐœ๋ฐœ์ด ์ž˜ ๋˜์–ด ์žˆ๋‹ค
    • ์˜ˆ) Spark - MLlib, Flink - FlinkML

Spark vs Flink ๋น„๊ต

  • Spark
    • Spark๋Š” ์ง„์ •ํ•œ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๊ฐ€ ์•„๋‹ˆ๋‹ค
    • ์ŠคํŒŒํฌ์˜ ์—”์ง„์€ ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ ๊ธฐ์ค€
    • ๋งˆ์ดํฌ๋กœ ๋ฐฐ์นญ
  • Flink
    • ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
    • ํ”Œ๋งํฌ์˜ ์—”์ง„์€ ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ ๊ธฐ์ค€

๋งˆ์ดํฌ๋กœ ๋ฐฐ์น˜ vs Window

  • ๋งˆ์ดํฌ๋กœ ๋ฐฐ์น˜: ๋ฐ์ดํ„ฐ ์ค‘ ์ผ๋ถ€๋ถ„ ๋–ผ์™€์„œ ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑ
  • Window: ์‹œ๊ฐ„์„ ์ •ํ•œ ํ›„, ๊ทธ ์‹œ๊ฐ„๋ถ€ํ„ฐ 10์ดˆ ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ๋ฅผ window๋กœ ๋ฌถ์–ด ์‚ฌ์šฉ

Spark vs Flink ๊ฐœ๋ฐœ ๋น„๊ต

  • Spark
    • Scala๋กœ ๊ฐœ๋ฐœ๋˜์–ด ์žˆ์Œ
    • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๊ฐ€ ์–ด๋ ต๋‹ค
    • Out of Memory ์—๋Ÿฌ๊ฐ€ ์ž์ฃผ ๋ฐœ์ƒ
    • ์˜์กด์„ฑ ๊ด€๋ฆฌ๋กœ DAG ์‚ฌ์šฉ
  • Flink
    • Java๋กœ ๊ฐœ๋ฐœ๋˜์–ด ์žˆ์Œ
    • ๋‚ด์žฅ ๋ฉ”๋ชจ๋ฆฌ ๋งค๋‹ˆ์ €
    • Out of Memory ์—๋Ÿฌ๊ฐ€ ์ž์ฃผ ์•ˆ๋‚œ๋‹ค
    • Controlled cyclic dependency graph (ML ๊ฐ™์ด ๋ฐ˜๋ณต์ ์ธ ์ž‘์—…์— ์ตœ์ ํ™”)

Flink์˜ ๋Œ€๋‹จํ•œ ์ 

  • Flink๋Š” ์•„๋ž˜ ์ŠคํŽ™๋“ค์„ ๊ฐ–๊ณ  ์žˆ๋Š” ์ฒซ๋ฒˆ์งธ ์˜คํ”ˆ์†Œ์Šค ํ”„๋ ˆ์ž„์›Œํฌ
  • ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ด๋ฃจ๊ณ  100๋งŒ ๋‹จ์œ„์˜ ์ด๋ฒคํŠธ๋ฅผ ์ฒ˜๋ฆฌ
  • Latency ๊ฐ€ 1์ดˆ ์ดํ•˜(sub-second)
  • Exactly-once: 1๋ฒˆ ์ด์ƒ์˜ ์ฒ˜๋ฆฌ๋ฅผ ๋ณด์žฅ -> ๋ณดํ†ต ๋‹ค๋ฅธ ์‹œ์Šคํ…œ๋“ค์€ at least once๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด๋‹ค (ํ•œ ๋ฒˆ ์ด์ƒ์˜ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ฑฐ๋‚˜ ๋ณด์žฅ์„ ๋ชปํ•˜๊ณ  ์ค‘๋ณต์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ๋„ ์žˆ๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์žƒ์–ด๋ฒ„๋ฆด์ˆ˜๋„ ์žˆ์Œ)
  • ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์žฅ

Flink ๊ตฌ์„ฑ

image

  • Storage
  • Deployment/Environment
  • Engine

Flink Storage Streaming

  • Flink๋Š” Spark์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌ๋งŒ ํ•˜๋Š” ์‹œ์Šคํ…œ์ด๋‹ค.
  • ๋”ฐ๋ผ์„œ, ๊ฐ์ข… ์ €์žฅ ์‹œ์Šคํ…œ๋“ค๊ณผ ์—ฐ๋™์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„
    • HDFS
    • Local File System
    • Mongo DB
    • RDBMS (MySQL, Postgres)
    • S3
    • Rabbit MQ

Flink Deployment

  • ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ๋„ ์—ฌ๋Ÿฌ ์‹œ์Šคํ…œ๊ณผ ์—ฐ๋™ํ•˜์—ฌ ์ด์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.
    • Local
    • Standalone ํด๋Ÿฌ์Šคํ„ฐ
    • YARN
    • Mesos
    • AWS / GCP

Flink ๋‚ด๋ถ€ ๊ตฌ์กฐ

  1. SQL: High-level Language
  2. Table API: Declarative DSL
  3. Data Stream / DataSet API: Core APIs
  4. Stateful Stream Processing: Low-level building block (streams, state, [event] time)
  • 4๋ฒˆ์„ ๊ทธ๋Œ€๋กœ ์“ธ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๋ณดํ†ต์€ ์ด ํŒŒํŠธ๋ฅผ ์“ฐ์ง€ ์•Š๋Š”๋‹ค.
  • ์‹ค์ œ๋กœ๋Š” 3๋ฒˆ์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, Data Stream์€ ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑํ•  ๋•Œ ์‚ฌ์šฉํ•˜๊ณ , Dataset API๋Š” ๋ฐฐ์น˜ ํ”„๋กœ์„ธ์‹ฑํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š”๋ฐ
  • Data Set API๋Š” ์ ์  ์•ˆ์“ฐ๋Š” ์ถ”์„ธ์ด๋ฉฐ ๊ณง Deprecated ๋  ์ˆ˜ ์žˆ๋‹ค.
  • 2๋ฒˆ: SparkSQL๊ณผ ๋น„์Šทํ•˜๊ฒŒ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์„ ์–ธ์ ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์คŒ. Spark์™€๋Š” ๋‹ค๋ฅด๊ฒŒ Table์ด Dynamicํ•˜๊ฒŒ ๋ณ€๊ฒฝ๋˜๋Š”์ ์ด ๋‹ค๋ฅด๋‹ค.
  • 1๋ฒˆ: ๊ฐ€์žฅ ๋†’์€ ๋‹จ๊ณ„์˜ ์ถ”์ƒํ™”, SQL๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

Flink์˜ Connectors

  • Flink๋Š” ์—ฌ๋Ÿฌ Connector ๋“ค๊ณผ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ
  • sink๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” ๊ณณ, source๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ๊ณณ
    • Apache Kafka (sink / source)
    • Elastic Search (sink)
    • HDFS (sink)
    • RabbitMQ (sink, source)
    • Amazon Kinesis (sink, source)
    • Twitter Streaming API (source)
    • Apache Cassandra (sink)
    • Redis (sink)

Flink์˜ ์จ๋“œํŒŒํ‹ฐ ํ”„๋กœ์ ํŠธ

  • Apache Zepplin - ์›น ๋ฒ ์ด์Šค ๋…ธํŠธ๋ถ
  • Apache Mahout - ๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • Cascading - Workflows ๋งค๋‹ˆ์ง€๋จผํŠธ
  • Apache Beam - Data pipeline ์ƒ์„ฑ / ๊ด€๋ฆฌ ํˆด

Flink ํ”„๋กœ๊ทธ๋žจ์˜ ์ผ๋ฐ˜์ ์ธ ํ”Œ๋กœ์šฐ

  • Source -> Operations,Transformations -> Sink
  • Source: RDB, Kafka, Local file
  • Sink: Kafka, HDFS, RDB ๋“ฑ

State

  • event ๊ฐ๊ฐ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด state๊ฐ€ ํ•„์š” ์—†๋‹ค
  • ์—ฌ๋Ÿฌ event๋ฅผ ํ•œ๊บผ๋ฒˆ์— ๋ณด๋ ค๊ณ  ํ•  ๋Œ€ state๊ฐ€ ํ•„์š”ํ•˜๋‹ค - stateful
  • ์˜ˆ
    • ํŒจํ„ด์„ ์ฐพ๋Š” ์ผ
    • ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐ„๋ณ„๋กœ ํ•ฉ์น˜๋Š” ์ผ
    • ๋จธ์‹ ๋Ÿฌ๋‹ ํŠธ๋ ˆ์ด๋‹
    • ๊ณผ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐธ๊ณ ํ•ด์•ผํ•˜๋Š” ์ผ
  • flink๋Š” ๋”ฐ๋ผ์„œ state๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค
  • checkpoints์™€ savepoints๋กœ state๋ฅผ ์ €์žฅํ•ด์„œ ๋‚ด๊ฒฐํ•จ์„ฑ์„ ๊ฐ–๋„๋ก ์„ค๊ณ„
  • queryable state๋ฅผ ์ด์šฉํ•ด์„œ ๋ฐ–์—์„œ state๋ฅผ ๊ด€์ฐฐํ•  ์ˆ˜๋„ ์žˆ๋‹ค

State Backend

  • Data Stream API๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ฒฝ์šฐ๋กœ state๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค
    • Window๋กœ ๋ฐ์ดํ„ฐ ๋ชจ์•„๋ณด๊ธฐ
    • Transformations (key-value state)
    • CheckpointedFunction์œผ๋กœ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ fault tolerantํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ
  • HashMapStateBackend
    • Java Heap์— ์ €์žฅ
    • Hash Table์— ๋ณ€์ˆ˜์™€ Trigger๋ฅผ ์ €์žฅ
    • ํฐ state, ๊ธด windows, ํฐ key/value ์Œ์„ ์ €์žฅํ•  ๋•Œ ๊ถŒ์žฅ
    • ๊ณ ๊ฐ€์šฉ์„ฑ ํ™˜๊ฒฝ
    • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์œผ๋กœ ๋น ๋ฅธ ์ฒ˜๋ฆฌ
  • EmbeddedRocksDBStateBackend
    • RocksDB์— ์ €์žฅ
    • ๋ฐ์ดํ„ฐ๋Š” byte array๋กœ ์‹œ๋ฆฌ์–ผ๋ผ์ด์ฆˆ ๋˜์–ด ์ €์žฅ
    • ๋งค์šฐ ํฐ state, ๊ธด window, ํฐ key/value state ์ €์žฅ
    • ๊ณ ๊ฐ€์šฉ์„ฑ ํ™˜๊ฒฝ
    • Disk์™€ Serialize ์‚ฌ์šฉ์œผ๋กœ ์„ฑ๋Šฅ์€ ๋’ค๋–จ์–ด์ง€๊ณ  / ์ฒ˜๋ฆฌ๋Ÿ‰์ด ๋Š˜์–ด๋‚œ๋‹ค (tradeoff)

Keyed State

  • Key-Value store
  • Keyed stream์—์„œ๋งŒ ์ด์šฉ ๊ฐ€๋Šฅ
  • ์˜ˆ๋ฅผ๋“ค์–ด ๊ฐ ์ด๋ฒคํŠธ๋Š” id, value ์Šคํ‚ค๋งˆ๋ฅผ ๊ฐ–๊ณ 
    • ๊ฐ id๋งˆ๋‹ค value๋ฅผ ๋”ํ•˜๊ณ  ์‹ถ์„ ๋•Œ keyed state๋ฅผ ์ด์šฉ

State ์ €์žฅ

  • ์žฅ์•  ํ—ˆ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ๊ธฐ๋Šฅ๋“ค
    • Stream replay
    • Checkpointing
  • checkpoint๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ์ €์žฅํ•ด์•ผ ํ•˜๋‚˜.
    • Trade off ์กด์žฌ
    • ๊ฐ€๋ฒผ์šด state๋ฅผ ๊ฐ€์ง„ ํ”„๋กœ๊ทธ๋žจ์€ ์ž์ฃผ ์ €์žฅํ•ด์ฃผ์–ด๋„ ๋œ๋‹ค
  • checkpoint๋ฅผ ํ•œ ์ดํ›„์— ์‹œ์Šคํ…œ์ด ๋ง๊ฐ€์งˆ ๊ฒฝ์šฐ
    • ํ”Œ๋งํฌ๋Š” ์ž‘๋™์„ ๋ฉˆ์ถ”๊ณ 
    • ์ฒดํฌํฌ์ธํŠธ๋กœ ๋ฆฌ์…‹

์ฒดํฌํฌ์ธํŒ…

  • ๋ถ„์‚ฐ๋œ ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์—์„œ ์–ด๋–ป๊ฒŒ snapshot์„ ๋งŒ๋“ค๊นŒ
  • Chandy-Lamport ์•Œ๊ณ ๋ฆฌ์ฆ˜.
  • ๋น„๋™๊ธฐ์ ์œผ๋กœ ์‹คํ–‰

Barriers

  • ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐ„๋ณ„๋กœ ๋‚˜๋ˆ„๋Š” barrier๋ฅผ ์‚ฝ์ž…ํ•ด snapshot์ด ๊ฐ€๋Šฅํ•˜๋‹ค
  • barrier๋Š” ๊ฐ€๋ฒผ์›Œ์„œ ์ŠคํŠธ๋ฆผ์— ๋ฐฉํ•ด๋˜์ง€ ์•Š๋„๋ก ์„ค๊ณ„
  • Sink operator๊ฐ€ barrier๋ฅผ ๋ฐ›์•„์„œ ์ƒˆ๋กœ์šด checkpoint๋ฅผ ๋งŒ๋“ ๋‹ค

Snapshotting

  • ์‚ฌ์ด์‚ฌ์ด์— ๋ผ์–ด๋†“์€ barrier๋ฅผ ๊ธฐ๋ก์„ ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.
  • barrier์™€ state๋ฅผ ๊ธฐ๋กํ•œ๋‹ค.

์ฒดํฌํฌ์ธํŠธ ์ •๋ ฌ

  • ๋ฐ์ดํ„ฐ๊ฐ€ ์˜ค๋Š”๋Œ€๋กœ ๋ฐ›์•„๋“ค์—ฌ ์ฒดํฌํฌ์ธํŠธ ๋งŒ๋“ค๊ธฐ
  • ๋น ๋ฅธ ์†๋„๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋žจ์„ ๋งŒ๋“ค ๋•Œ ์‚ฌ์šฉ
  • Exactly-once ๋ณด๋‹ค๋Š” at-least-once๋ฅผ ๋ณด์žฅํ•œ๋‹ค.

Recovery

  • ์žฅ์• ๊ฐ€ ๋‚˜๋ฉด ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค
  • ์‹œ์Šคํ…œ์€ dataflow ์ „์ฒด๋ฅผ re-deploy ํ•œ๋‹ค
  • ๊ฐ operator์—๊ฒŒ ์ฒดํฌํฌ์ธํŠธ์˜ state ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•œ๋‹ค
  • ์ž…๋ ฅ stream๋„ ์ฒดํฌํฌ์ธํŠธ์ผ ๋•Œ๋กœ ๋Œ๋ ค๋†“๋Š”๋‹ค - ์ž…๋ ฅ ์ŠคํŠธ๋ฆผ ์ž์ฒด๊ฐ€ ์ฒดํฌํฌ์ธํŠธ๋กœ ๋Œ๋ ค๋†“๋Š” ์ž‘์—…์„ ์ง€์›ํ•ด์•ผ ํ•œ๋‹ค(์ด๋Ÿฐ์ ์—์„œ kafka๋ž‘ ๊ถํ•ฉ์ด ์ž˜ ๋งž๋Š”๋‹ค)
  • ์žฌ์‹œ์ž‘

Savepoints

  • ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ ์ฒดํฌํฌ์ธํŠธ
  • ๋‹ค๋ฅธ ์ฒดํฌํฌ์ธํŠธ์ฒ˜๋Ÿผ ์ž๋™์œผ๋กœ ์—†์–ด์ง€์ง€ ์•Š๋Š”๋‹ค

Exactly once vs At least once

  • ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ์ฒดํฌํฌ์ธํŠธ ์ •๋ ฌ ์—ฌ๋ถ€
  • ์†๋„๊ฐ€ ์ค‘์š”ํ•  ๊ฒฝ์šฐ at least once ์‚ฌ์šฉ

๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์‹œ ์‹œ๊ฐ„ ๊ฐœ๋…์ด ๋“ค์–ด๊ฐˆ ๋•Œ

  • Time Series Analysisํ•  ๋•Œ
  • Windows์“ธ ๋•Œ
  • Event time์ด ์ค‘์š” ํ•  ๋•Œ

Time์˜ ์ข…๋ฅ˜

  • Event Time
  • Processing Time

Processing Time

  • ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์‹œ์Šคํ…œ์˜ ์‹œ๊ฐ„
  • Hourly time window
    • 9:15๋ถ„ ์‹œ์Šคํ…œ ์‹œ์ž‘
    • 9:15 - 10:00
    • 10:00 - 11:00
  • ๊ฐ€์žฅ ๋น ๋ฅธ ์„ฑ๋Šฅ๊ณผ Low Latency
  • ํ•˜์ง€๋งŒ ๋ถ„์‚ฐ๋˜๊ณ  ๋น„๋™๊ธฐ์ ์ธ ํ™˜๊ฒฝ์—์„œ๋Š” ๊ฒฐ์ •์ (deterministic) ์ด์ง€ ๋ชปํ•œ๋‹ค
    • ์ด๋ฒคํŠธ๊ฐ€ ์‹œ์Šคํ…œ์— ๋„๋‹ฌํ•˜๋Š” ์†๋„์— ๋‹ฌ๋ ธ๊ธฐ ๋•Œ๋ฌธ์—

Event Time

  • Event๊ฐ€ ์ƒ์„ฑ๋œ ๊ณณ์—์„œ ๋งŒ๋“ค์–ด์ง„ ์‹œ๊ฐ„
  • Flink์— ๋„๋‹ฌํ•˜๊ธฐ ์ „ ์ด๋ฒคํŠธ ์ž์ฒด์— ๊ธฐ๋ก ๋ณด๊ด€
  • ์‹œ๊ฐ„์€ ์‹œ์Šคํ…œ์ด ์•„๋‹ˆ๋ผ data ์ž์ฒด์— ์˜์กด
  • ์ด๋ฒคํŠธ ํƒ€์ž„ ํ”„๋กœ๊ทธ๋žจ์€ Event Time Watermark๋ฅผ ์ƒ์„ฑํ•ด์•ผ ๋œ๋‹ค

Evemt Time๊ณผ Processing Time์ด ์•ˆ ๋งž์„ ๋–„

  • Evnet Time์— ์˜์กดํ•˜๋Š” ์‹œ์Šคํ…œ์€ ์‹œ๊ฐ„์˜ ํ๋ฆ„์„ ์žฌ๋Š” ๋ฐฉ๋ฒ•์ด ๋”ฐ๋กœ ํ•„์š”ํ•˜๋‹ค
  • ์˜ˆ) 1์‹œ๊ฐ„ ์งœ๋ฆฌ window operation์ด๋ฉด 1์‹œ๊ฐ„์ด ํ˜๋ €๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„์•ผ ํ•œ๋‹ค
  • Event Time๊ณผ Processing Time์€ ์‹ฑํฌ๊ฐ€ ์•ˆ๋งž์„ ์ˆ˜ ์žˆ๋‹ค
  • ์˜ˆ) 1์ฃผ ์งœ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ช‡์ดˆ ๋งŒ์— ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค
  • ๊ทธ๋ž˜์„œ ๋‚˜์˜จ๊ฒƒ์ด watermark

Watermark

  • Flink๊ฐ€ event time์˜ ํ๋ฆ„์„ ์žฌ๋Š” ๋ฐฉ๋ฒ•
  • Watermark(t): timestamp t <= t (์ ์–ด๋„ t๊นŒ์ง„ ์™”๋‹ค)

๋ณ‘๋ ฌ ํ™˜๊ฒฝ์—์„œ์˜ Watermark

  • ์—ฌ๋Ÿฌ input stream์„ ๋ฐ›๋Š” operator์˜ ๊ฒฝ์šฐ ๊ฐ€์žฅ ๋‚ฎ์€ event time์„ ์‚ฌ์šฉ

Flink์˜ ํด๋Ÿฌ์Šคํ„ฐ ๋งค๋‹ˆ์ €

  • ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์œผ๋กœ์„œ ์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค ๋ถ„๋ฐฐ๊ฐ€ ํšจ์œจ์ ์ด์–ด์•ผ ํ•œ๋‹ค
  • ๋ฆฌ์†Œ์Šค ๋งค๋‹ˆ์ €์˜ ์ข…๋ฅ˜
    • YARN
    • Kubernetes

Flink์˜ ์•„ํ‚คํ…์ฒ˜ - Job Manager

  • Task ์Šค์ผ€์ค„๋ง (๋‹ค์Œ Task๊ฐ€ ์–ธ์ œ ์‹คํ–‰ ๋  ์ง€)
  • ์‹คํŒจ/์™„๋ฃŒ๋œ Tasks ๊ด€๋ฆฌ
  • ์ฒดํฌํฌ์ธํŠธ ๊ด€๋ฆฌ
  • ์‹คํŒจ์‹œ Recovery
  • 3๊ฐ€์ง€์˜ ์ปดํฌ๋„ŒํŠธ
    • Resource Manager - task solt ๊ด€๋ฆฌ
    • Dispatcher - Flink app์„ ๋“ฑ๋ก ํ•˜๋Š” REST API & web UI
    • JobMaster - 1๊ฐœ์˜ JobGraph ๊ด€๋ฆฌ

Flink์˜ ์•„ํ‚คํ…์ฒ˜ - Task Manager

  • aka workers
  • Dataflow์˜ task๋ฅผ ์‹คํ–‰ํ•˜๋Š” ์ฃผ์ฒด
  • Task slot - ํ…Œ์Šคํฌ ๋งค๋‹ˆ์ €๋ฅผ ์Šค์ผ€์ค„๋งํ•˜๋Š” ๊ฐ€์žฅ ์ž‘์€ ๋‹จ์œ„
  • Task slot์œผ๋กœ ๋™์‹œ์— ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋Š” tasks ์„ค์ •

Flink์˜ ์•„ํ‚คํ…์ฒ˜ - Task Slots

  • Task Worker (TaskManager)๋Š” JVM ํ”„๋กœ์„ธ์Šค
    • ์—ฌ๋Ÿฌ ์“ฐ๋ ˆ๋“œ์—์„œ ํ•˜๋‚˜ ํ˜น์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ sub task๋ฅผ ์‹คํ–‰ ๊ฐ€๋Šฅ
    • ํ•˜๋‚˜์˜ TaskManager๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” Task ์ˆ˜๋Š” Task Slot์œผ๋กœ ์กฐ์ ˆ

Pyflink ์—ญ์‚ฌ

  • 2019๋…„ 8์›” Pyflink (Table API) ๋ฒ ํƒ€ ๋ฒ„์ „
  • 2020๋…„ 2์›” Apache-flink๊ฐ€ pypi์—์„œ ๋‹ค์šด๋กœ๋“œ ๊ฐ€๋Šฅ ํ•ด์ง
  • 2020๋…„ 7์›” Python UDF, SQL, UDF metrics, Pandas UDFs
  • ํŒŒ์ด์ฌ์ธ ์ด์œ 
    • Data Sciencedhk ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์–ธ์–ด
    • ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ด๋ฌด์–ดํฌ, ํŒ๋‹ค์Šค ๋“ฑ์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

Pyflink ๋ž€

  • Apache Flink์œ„์— ์˜ฌ๋ ค์ง„ Python API
  • ํŒŒ์ด์ฌ์œผ๋กœ ์ŠคํŠธ๋ฆผ ํ”„๋กœ์„ธ์‹ฑ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค

Pyflink์˜ ํผํฌ๋จผ์Šค ์ตœ์ ํ™”

  • Data๋ฅผ ์ฃผ๊ณ  ๋ฐ›๋Š” ๊ฒƒ์„ ์ตœ์†Œํ™”
  • Serialization / Deserialization
  • ๋น ๋ฅธ Python UDF
  • CPU utilization

flink ์„ค์น˜

https://www.apache.org/dyn/closer.lua/flink/flink-1.15.1/flink-1.15.1-bin-scala_2.12.tgz # ์ ‘์† ํ›„ ๋‹ค์šด๋กœ๋“œ 

pyflink ์„ค์น˜

pip install apache-flink

WordCount java ์‹คํ–‰

./bin/flink run examples/streaming/WordCount.jar
tail log/flink-*-taskexecutor-*.out

WordCount python ์‹คํ–‰

./bin/flink run --python examples/python/datastream/word_count.py
# Job has been submitted with JobID cc2541ee4ded7e34d7b12d812134fd96

flink ํด๋Ÿฌ์Šคํ„ฐ ์‹คํ–‰ ๋ฐ ์ข…๋ฃŒ

  • ์‹คํ–‰
$ ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host.
Starting taskexecutor daemon on host.
  • ์‹คํ–‰ ํ™•์ธ
$ ps aux | grep flink
  • ์ข…๋ฃŒ
$ ./bin/stop-cluster.sh
  • ์›น ui ํ™•์ธ
http://localhost:8081/#/overview