Efforts to monitor groundwater pumping for irrigation in the Western United States (Western US) are hindered by a lack of comprehensive pumping records. While previous studies have developed region-specific machine learning models using limited datasets, these models are often not transferable across regions, and a groundwater pumping dataset that goes beyond local and state boundaries remains missing. In this study, we develop a regional-scale, data-driven machine learning framework to address these limitations by integrating remote sensing datasets and in situ pumping records from Arizona, Colorado, Kansas, and Nevada. Using gridded hydroclimatic and land use variables, including effective precipitation, fraction of irrigated croplands, and evapotranspiration, the model generates spatially continuous, high-resolution (2 km, annual) historical groundwater pumping estimates from 2000 to 2023 for groundwater-dominated basins of the Western US, while predicting total irrigation in conjunctive basins. The model demonstrates good predictive performance under randomized split, with a Nash-Sutcliffe efficiency (NSE) = 0.62, normalized root mean square error (NRMSE) = 0.50, normalized mean absolute error (NMAE) = 0.34, and percent bias (PBIAS) = 8.59% on the test set. Model evaluation over groundwater-dominated and conjunctive basins across the region shows satisfactory results. In addition, comparisons using spatial holdout analysis and power consumption-based pumping records in multiple basins indicate generalization capacity and spatial transferability within the study region. Our assessment identifies limited availability of in situ pumping records and lack of surface water irrigation datasets as the primary constraints for further advancing such regional-scale frameworks. Overall, the findings highlight that regional transferability of machine learning models for predicting groundwater irrigation is achievable but contingent on holistic representation of the hydrologic system.
- Predicted pumping and total irrigation maps
- Running the repository
- Data availability
- Citations
- Organizations
- Funding
The repository has five main modules described as follows-
Codes/
├── __init__.py
├── download_preprocess/
│ ├── download.py
│ ├── download_openET.py
| ├── preprocess.py
│ ├── dp_driver.py
│ └── dp_driver.sh
├── models/
│ ├── ann_df.py
│ ├── ann_df.sh
│ ├── ann_model.py
│ ├── ann_model.sh
│ ├── ml_driver.py
│ ├── ml_driver.sh
│ ├── ml_driver_LOBO.py
│ ├── ml_driver_LOBO.sh
│ ├── ml_uncertainty.py
│ └── ml_uncertainty.sh
├── pumping/
│ └── pumping.py
├── results_analysis/
│ ├── __init__.py
│ ├── analysis_utils.py
│ ├── basin_compile.py
│ ├── basin_compile_LOBO.py
│ ├── comparison_basinScale.ipynb
│ ├── conjuctive_basins_water_balance.ipynb
│ ├── model_diagnosis.ipynb
│ ├── plots.py
│ └── stats_ops.py
└── utils/
├── __init__.py
├── DL_ops.py
├── ML_ops.py
├── plots.py
├── raster_ops.py
├── stats_ops.py
├── system_ops.py
└── vector_ops.py
Data_main/
├── pumping/
│ ├── Arizona/
│ ├── Colorado/
│ ├── Kansas/
│ ├── Nevada/
│ └── Utah/
├── ref_rasters/
├── ref_shapes/
└── shapefiles/
└── Basins_of_interest/
1. utils - Utility scripts for core operations across the repository:
raster_ops.py- Raster processing (read/write arrays, clipping, resampling, masking)vector_ops.py- Vector operations (buffering, clipping shapefiles, coordinate transformations)stats_ops.py- Statistical metrics (RMSE, MAE, R², NRMSE, PBIAS calculations)ML_ops.py- Machine learning operations using LightGBM (data preparation, training, hyperparameter tuning via Hyperopt, SHAP analysis, prediction)DL_ops.py- Deep learning operations using PyTorch (DataLoader, ANN model architecture, training with Optuna optimization)plots.py- Visualization utilitiessystem_ops.py- File system operations
2. download_preprocess - Scripts for data acquisition and preprocessing:
download.py- Functions to download data from Google Earth Engine (GRIDMET, DAYMET products)download_openET.py- Functions to download OpenET and irrigation fraction datasets (IrrMapper, LANID)preprocess.py- Data preprocessing and compilation functionsdp_driver.py- Main driver script that executes functionalities indownload.py,download_openET.py, andpreprocess.pyto download and preprocess all datasets
3. pumping - Pumping data processing module:
pumping.py- Processes, filters, and rasterizes in-situ pumping records from Arizona, Colorado, Kansas, and Nevada. Includes well coordinate transformation, data quality filtering, and rasterization. Output serves as training data for the ML model.
4. models - Core machine learning module:
ml_driver.py- Main ML driver for model training, testing, and prediction using LightGBM DARTml_driver_LOBO.py- Leave-One-Basin-Out (LOBO) cross-validation driver for spatial transferability assessmentml_uncertainty.py- Bootstrap-based uncertainty quantification and confidence interval estimation- Associated
.shscripts for HPC job submission
5. results_analysis - Model evaluation and results compilation:
basin_compile.py/basin_compile_LOBO.py- Compile basin-scale predicted and actual pumping dataanalysis_utils.py- Utility functions for results analysiscomparison_basinScale.ipynb- Basin-scale comparison of actual vs predicted pumping with scatter plots, time series analysis, and performance metrics (R², RMSE, MAE) across groundwater-dominated basinsconjuctive_basins_water_balance.ipynb- Water balance analysis for conjunctive basins (South Platte River Basin, CO and Pinal AMA, AZ) to compute total irrigation from groundwater and surface water sourcesmodel_diagnosis.ipynb- Model performance diagnostics- Various notebooks for water balance analysis and result visualization
The utils module does not require direct execution. Other modules should be executed using their respective driver files. Please reach out to the authors for additional support in running this repository.
For full model implementation, execute modules in the following order:
- download_preprocess → Run
dp_driver.pyto download and preprocess all input datasets - pumping → Run
pumping.pyto process and rasterize in-situ pumping records (training data) - models → Run
ml_driver.pyfor model training/prediction orml_driver_LOBO.pyfor spatial validation - results_analysis → Use notebooks and scripts to analyze model outputs
conda environment: A conda environment, set up using Anaconda with Python 3.9, has been used to implement this repository. The yml_files_env folder contains .yml files to set up similar conda environments for both Linux and Windows.
Key packages:
lightgbm- LightGBM DART regressor for ML modelinghyperopt- Bayesian hyperparameter optimizationshap- Model interpretability and feature importancerasterio,gdal- Geospatial raster operationsgeopandas- Vector data processingtorch- PyTorch for deep learning (experimental ANN)optuna- Neural network hyperparameter tuningearthengine-api- Google Earth Engine data access
This repository includes the in-situ pumping datasets and associated shapefiles used to process and train the ML model. The Data_main/pumping/ folder contains state-level pumping records from Arizona, Colorado, Kansas, Nevada, and Utah, along with reference rasters, shapefiles, and basin boundaries required for data processing and model implementation.
Google Earth Engine Dataset:
The annual groundwater pumping/total irrigation estimates (2000-2023) are available as a Google Earth Engine ImageCollection:
projects/ee-westus-pumping/assets/westus_pumping
Note: The dataset represents groundwater pumping in groundwater-dominated basins, but total irrigation in conjunctive basins (where significant surface water irrigation is supplemented by groundwater supply). The groundwater-dominated vs conjunctive basin classification can be visualized by the
GW_use_binaryasset, provided in the following GEE code snippet.
Sample code for visualization and data download from GEE is available at:
https://code.earthengine.google.com/5f5f1dcc3840126545e6860015c982e8
HydroShare repository:
The annual groundwater pumping/total irrigation estimates (2000-2023) can also be downloaded from the following HydroShare repository.
https://www.hydroshare.org/resource/cce80224863c4933a94c51a25c4ff8f3/
Dataset Citation: Hasan, M. F., Smith, R. G., Davenport, F. V., & Majumdar, S. (2026). Dataset: Historical groundwater pumping estimates for major agricultural basins of the Western United States, HydroShare, https://doi.org/10.4211/hs.cce80224863c4933a94c51a25c4ff8f3
- Hasan, M. F., Smith, R. G., Davenport, F. V., Majumdar, S. (2026). Extending Historical Groundwater Pumping Estimates for Major Agricultural Basins of the Western United States with Machine Learning and Satellite Products. In Prep. for Journal of Hydrology.



