From 7b7fdb54feb4ecaa0baff2328687128113d8b183 Mon Sep 17 00:00:00 2001
From: Diane Cloud <72690205+dianecloud@users.noreply.github.com>
Date: Thu, 29 Jan 2026 15:51:08 -0800
Subject: [PATCH] Update README.md

## Data generation using pure python, HAWQ (with PL/Python), or MapReduce (streaming via python)

### Instructions are included for each of the 3. MapReduce version is in the early stages and it's not currently recommended.

TODO:
* location of locations_partitions.csv is hardcoded (fixed?)
* come up with a realistic template so numbers aren't out of whack
* script to calculate expected outputs based on profiles
* for transactions, give the option to provide either a folder of all profiles to iterate through or just one json (automatic checking)
* user input to generate config files

* test output against profiles
* add shell scripts to install python packages
* add shell scripts to fix hard coding for HAWQ and MR
* clean up HAWQ and MR code
* add more/better data

* improve performance of MapReduce
* Spark streaming?

* create_pickles doesn't run if the number of years doesn't match the profile inputs
* work on making datasets repeatable via random seed
* script to replace hashbang with `which python`
* script to replace hard links