There are many open datasets that you can download for practicing activities in big data platforms. It is suggested that you focus on a single domain and use the data to perform the work in the course. The following datasets can be used for both batch and streaming analytics and used in different tasks (ingestion, processing, etc.)
-
Healthcare/Pandemic
-
Mobility/Transportation
- Telecom Italia, Telecommunications - SMS, Call, Internet - TN: Source 1 and Source 2, also check the paper
- NYC Taxi Dataset
- For multiple monthly datasets: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Chicago Taxi Dataset
-
Animals and Wildlife monitoring
-
Social networks/Multimedia
-
Marketplace
-
Environment Monitoring/Smart City
-
System/Datacenter operation traces/logs:
-
Civil engineering/architect/construction data:
-
Cybersecurity:
-
Industry/Manufacturing:
When using these datasets, you need to comply with their corresponding licenses.
The datasets are stored within this directory and provided by Linh Truong for the course. Keep in mind the license of the data.
- bts data: in bts, samples of sensor data monitored within base transceiver stations
- network operation monitoring: in onudata, a small dataset about network monitoring
You can also propose your own dataset for your assignment but you must discuss with the lecturer first.