-
To predict the value of pH using remote sensing indices, topography, and sampling data.
-
This repository aims to experiment with whether pre-trained knowledge from large language models helps predict pH in a cold-start setting.
- Clone this repository.
git clone https://github.com/Gaurav0502/ph-regressor-cot.git- Install all packages in the
requirements.txtfile.
pip install -r requirements.txt- Ensure the directory structure is as follows:
.
├── README.md
├── data
│ ├── batches
│ │ ├── batch_outputs_gemini_prediction-model-2025-11-20T01_34_10.422908Z_predictions.jsonl
│ │ └── pH_regression_gemini.jsonl
│ ├── folds
│ │ └── data_pH.json
│ └── grid.csv
├── pipeline
│ ├── batch.py
│ ├── batch_inference.py
│ ├── evaluator.py
│ ├── preprocessor.py
│ ├── prompt.py
│ └── upload_to_gcp.py
├── .env
├── requirements.txt
└── .gitignoreIn the above directory structure:
-
the
*_regression_gemini.jsonlanddata_*.jsonwill be created automatically afterpipeline/preprocessor.pyis executed. -
the
batch_outputs_*_predictions.jsonlwill be created by GCP after your batch inference task finishes executing. (This file can be downloaded from the GCP Bucket) -
Install
gcloudCLI using the archives from Google. Execute the following commands:
gcloud init
gcloud auth application-default loginThis will ensure you are authenticated, and this information will be stored locally.
- Create a
.envfile and keep two secrets in it:PROJECT_IDandBUCKET_NAME. If you are using the API key approach to authenticate, then it goes in this file.
touch .envYour .env file should look as follows:
PROJECT_ID=<YOUR GCP PROJECT ID>
BUCKET_NAME=<YOUR GCP BUCKET NAME>- Execute the following Python script to create the batches.
import pandas as pd
import numpy as np
from pipeline.preprocessor import BatchCreator
df = pd.read_csv("data/grid.csv")
df.rename(columns={'Organic.Matter....': 'SOM'}, inplace=True)
targets = ["pH", "SOM"]
for target in targets:
batch_creator = BatchCreator(df.copy(), target)
batch_creator.create_batch()
batch_creator.save_batches_as_jsonl()- Execute the commands to push the batches to the GCP bucket. (**)
cd pipeline
python3 upload_to_gcp.py
- Execute the commands to submit batch inferences to Vertex AI. (**)
cd pipeline
python3 batch_inference.py- Run the Python code below to evaluate the regression model.
from pipeline.evaluator import Evaluator
df = pd.read_csv("data/grid.csv")
eval = Evaluator(df=df,
target="pH",
folds_file_path="data/folds/data_pH.json",
predictions_file_path="<YOUR PREDICTION FILE PATH>")- This should execute the code in the repository successfully. If there are problems, you can raise an issue!
** This execution requires billing enabled on GCP.
Note: When I executed these batches in Vertex AI (gemini-2.5-flash), I observed an issue with maxOutputTokens. The internal limit is 65,536 tokens. In a few cases, batches were finished due to exceeding the token limit, and no output was returned as the model was interrupted. I do check for this in pipeline/evaluation.py before extracting the final prediction.