Skip to content

Latest commit

 

History

History
12 lines (8 loc) · 651 Bytes

File metadata and controls

12 lines (8 loc) · 651 Bytes

DataEngineering_Group7

Data Uploading

To upload data to HDFS, use the following command, change the dates as needed:

wget -qO- https://files.pushshift.io/reddit/comments/RC_2011-08.zst | zstd --long=31 -d --stdout - | docker exec -i hadoop-namenode hdfs dfs -put - /reddit/RC_2011-08.json

This will download the data from the pushshift.io website, decompress it, and upload it to HDFS, all without storing anything in the middle as it is streamed. The data is stored in the /reddit directory on HDFS.

Check the README files inside horizontal solution and vertical solution folders to see the instructions to reproduce the work!