This project implements a crop price prediction system for the DataCrunch Final Round competition ("Legacy of the Market King: The Freezer Gambit"). The goal is to predict weekly fresh crop prices four weeks ahead across various economic centers.
This solution utilizes a time series forecasting approach based on a LightGBM model trained on historical price and weather data. Feature engineering includes time-based features, price lag features, and price rolling window statistics. Hyperparameters were tuned using Optuna for optimal performance. The prediction system is served via a FastAPI application packaged within a Docker container.
Source Code: https://github.com/mehara-rothila/Data_Crunch-Xforce.git
.
├── Datasets/ # Contains raw CSV data (copied into image for context)
├── deployment/ # Core application logic, model, features
│ ├── init.py
│ ├── data_loader.py # Loads and merges price/weather data
│ ├── preprocessing.py # Basic preprocessing (renaming, type conversion)
│ ├── feature_engineering.py # Creates time, lag, and rolling features
│ ├── model_trainer.py # Trains model (incl. Optuna tuning), saves artifacts
│ ├── predictor.py # Loads model/features, generates predictions
│ ├── lgbm_price_model.joblib # Saved final trained LightGBM model
│ └── model_features.joblib # List of features used by the final model
├── main.py # FastAPI application entry point & endpoint definitions
├── Dockerfile # Instructions to build the Docker image
├── requirements.txt # Python package dependencies
├── image_name.txt # Docker image name/URI (from Docker Hub)
├── README.md # This file
├── .gitignore # Specifies intentionally untracked files for Git
└── Documentation.pdf # Detailed solution documentation (To be created)
└── Presentation # Presentation slides (To be created)
(Note: __pycache__ directories are generated by Python and ignored by git via .gitignore)
The data processing follows these steps within the deployment modules:
- Loading (
data_loader.py): Loadstrain_data.csvfor price/weather, parses dates, merges datasets, and drops duplicate weather entries. - Preprocessing (
preprocessing.py): Renames price column, converts identifiers (Region,Commodity,Type) tocategorydtype. - Feature Engineering (
feature_engineering.py): Creates features:- Time Features: Year, month, week, day of week, day of year, etc.
- Lag Features (Price): Price from 28, 35, 42 days prior.
- Rolling Window Features (Price): Mean and standard deviation over 7, 14, 28 days (using
.shift(1)to prevent leakage).
- NaN Handling: Rows with NaNs from feature generation are dropped before training.
- Model: LightGBM Regressor (
lightgbm.LGBMRegressor). - Features Used: Combination of original preprocessed data and engineered features (time, price lags, price rolling windows). See
deployment/model_features.joblib. - Top Features: Commodity, Region, Price Lags (42, 35, 28 days), Price Rolling Features (Mean/Std over 7, 14, 28 days), and original weather metrics are all important.
- Hyperparameter Tuning: Optuna used (30 trials) to minimize validation RMSE.
- Validation Strategy: Time-based split (Train up to 2043-04-30, Validate on May-June 2043).
- Final Validation RMSE: ~20.06 (achieved on the time-based validation set with the tuned model).
This application is designed to run inside a Docker container. Docker allows packaging an application with all its dependencies (libraries, code, system tools) into a standardized unit, ensuring it runs consistently across different environments.
- Docker Installation: You need Docker installed and running on your computer. Docker acts as the engine to build and run containers.
- Windows/Mac: Download and install Docker Desktop from the official Docker website: https://www.docker.com/products/docker-desktop/
- Linux: Follow the instructions for your specific distribution: https://docs.docker.com/engine/install/
- After installation, ensure the Docker service/daemon is running (Docker Desktop usually starts it automatically). You can verify by opening a terminal or command prompt and typing
docker --version. You should see a version number.
- Project Files: You need all the project files (downloaded or cloned from GitHub) in a single directory on your computer.
- Terminal/Command Prompt: You will need to run commands in your system's terminal (like Command Prompt or PowerShell on Windows, Terminal on Mac/Linux).
If you don't have the files locally, clone the repository using Git:
git clone [https://github.com/mehara-rothila/Data_Crunch-Xforce.git](https://github.com/mehara-rothila/Data_Crunch-Xforce.git)
cd Data_Crunch-XforceMake sure your terminal's current directory is the project root (the folder containing the Dockerfile).
The Dockerfile in the project contains the recipe for creating the application image. Building the image packages the Python environment, libraries, code, model, and necessary data.
Command: Open your terminal in the project root directory and run:
docker build -t mehararothila/data-crunch-predictor:v1.1 .Explanation:
docker build: The command to start the image build process.-t mehararothila/data-crunch-predictor:v1.1: This assigns a memorable name (a "tag") mehararothila/data-crunch-predictor:v1.1 to the image you are building. This makes it easier to refer to later..: This crucial dot tells Docker to look for the Dockerfile in the current directory and use the contents of this directory as the "build context" (files to be potentially copied into the image).
Process: Docker will execute the steps in the Dockerfile sequentially. This involves downloading the base Python image, installing system dependencies (like libgomp1), installing all Python packages listed in requirements.txt, and copying your application code and data files into the image. This may take several minutes, especially the first time. Watch for any error messages. A successful build ends with messages like => exporting to image and => naming to ....
Once the image is built successfully, you can run a container based on that image. This starts the FastAPI application inside the isolated container environment.
Command: In your terminal, run:
docker run -p 8000:8000 --rm --name predictor-app mehararothila/data-crunch-predictor:v1.1Explanation:
docker run: The command to create and start a container from an image.-p 8000:8000: This is the port mapping. It connects port 8000 on your host machine (your computer, the first 8000) to port 8000 inside the container (the second 8000, where the Uvicorn server is listening). This allows you to access the API from your browser using localhost:8000. Make sure port 8000 is not already used by another application on your host.--rm: This is a cleanup flag. It tells Docker to automatically remove the container (but not the image) when it stops (e.g., when you press CTRL+C in the terminal).--name predictor-app: Assigns a convenient name predictor-app to the running container instance, making it easier to manage if needed.mehararothila/data-crunch-predictor:v1.1: Specifies the name of the image you want to run the container from (the one you built in the previous step).
Output: After running the command, you should see log output in your terminal, including lines from Uvicorn like:
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on [http://0.0.0.0:8000](http://0.0.0.0:8000) (Press CTRL+C to quit)
This indicates the API server is running successfully inside the container and is accessible. The terminal will remain attached to the container's logs.
Stopping the Container: To stop the server and the container, go back to the terminal where it's running and press CTRL+C. Because we used the --rm flag, the container will be automatically removed.
(Alternative Run Command using Docker Hub image): If you didn't build the image locally but want to run the one pushed to Docker Hub, use:
docker run -p 8000:8000 --rm --name predictor-app mehararothila/data-crunch-predictor:v1.1IMPORTANT DEPLOYMENT INFORMATION
The deployed API documentation (Swagger UI) can be accessed at the following URL:
http://api.mehara.io:8000/docs
Alternatively, you can use the direct IP address:
http://64.227.137.70:8000/docs
Note on Browser Warnings: Since the deployment uses standard HTTP on port 8000 (and not HTTPS with an SSL certificate), your browser will likely display a "Not Secure" warning page when you first access the link.
This is expected behavior. Please click the button or link labeled "Continue to site", "Advanced" -> "Proceed", or similar wording to access the API documentation page.
With the container running, the prediction API is accessible on your local machine.
The easiest way to interact with the API is through the automatically generated Swagger UI documentation.
- Open Browser: Open your preferred web browser (Chrome, Firefox, Edge, etc.).
- Navigate: Go to the address http://localhost:8000/docs.
- Explore: You will see the API documentation page listing the available endpoints (
/api/predict,/api/data/weather,/api/data/prices). - Test Prediction:
- Click on the POST
/api/predictendpoint bar to expand it. - Click the "Try it out" button on the right side.
- An editable "Request body" field will appear, pre-filled with an example. Modify the crop and region values if desired (use values known to be in the training data for meaningful results, e.g., "Cantaloupe", "Valhalla").
- Click the blue "Execute" button.
- Scroll down to see the "Server response". It will show the curl command equivalent, the request URL, and the response body (containing the predictions) or any error messages, along with the HTTP status code (e.g., 200 for success).
- Click on the POST
If you prefer using the command line and have curl installed:
curl -X POST "http://localhost:8000/api/predict" -H "accept: application/json" -H "Content-Type: application/json" -d '{"crop": "Cantaloupe", "region": "Valhalla"}'-X POST: Specifies the HTTP POST method."http://localhost:8000/api/predict": The URL of the endpoint.-H "accept: application/json": Header indicating the client accepts JSON responses.-H "Content-Type: application/json": Header indicating the request body is JSON.-d '{"crop": "Cantaloupe", "region": "Valhalla"}': The JSON data being sent in the request body.
The response JSON will be printed directly to your terminal.
- Accuracy (RMSE): Optimized to ~20.06 via feature engineering and Optuna tuning.
- Resources: Uses efficient LightGBM, category dtypes, slim base image. Image size ~585MB (well under 8GB limit). RAM usage expected to be within <2GB limit.
- Packaging: Delivered as a runnable Docker image with code and dependencies.
- API: Implements the specified API endpoints.
- Reproducibility: Uses fixed random seed for model training.