pH Regression using CoT

Aim

To predict the value of pH using remote sensing indices, topography, and sampling data.
This repository aims to experiment with whether pre-trained knowledge from large language models helps predict pH in a cold-start setting.

Environment Setup

Clone this repository.

git clone https://github.com/Gaurav0502/ph-regressor-cot.git

Install all packages in the requirements.txt file.

pip install -r requirements.txt

Ensure the directory structure is as follows:

.
├── README.md
├── data
│   ├── batches
│   │   ├── batch_outputs_gemini_prediction-model-2025-11-20T01_34_10.422908Z_predictions.jsonl
│   │   └── pH_regression_gemini.jsonl
│   ├── folds
│   │   └── data_pH.json
│   └── grid.csv
├── pipeline
│   ├── batch.py
│   ├── batch_inference.py
│   ├── evaluator.py
│   ├── preprocessor.py
│   ├── prompt.py
│   └── upload_to_gcp.py
├── .env
├── requirements.txt
└── .gitignore

In the above directory structure:

the *_regression_gemini.jsonl and data_*.json will be created automatically after pipeline/preprocessor.py is executed.
the batch_outputs_*_predictions.jsonl will be created by GCP after your batch inference task finishes executing. (This file can be downloaded from the GCP Bucket)
Install gcloud CLI using the archives from Google. Execute the following commands:

gcloud init
gcloud auth application-default login

This will ensure you are authenticated, and this information will be stored locally.

Create a .env file and keep two secrets in it: PROJECT_ID and BUCKET_NAME. If you are using the API key approach to authenticate, then it goes in this file.

touch .env

Your .env file should look as follows:

PROJECT_ID=<YOUR GCP PROJECT ID>
BUCKET_NAME=<YOUR GCP BUCKET NAME>

Execute the following Python script to create the batches.

import pandas as pd
import numpy as np
from pipeline.preprocessor import BatchCreator

df = pd.read_csv("data/grid.csv")
df.rename(columns={'Organic.Matter....': 'SOM'}, inplace=True)
targets = ["pH", "SOM"]
for target in targets:
    batch_creator = BatchCreator(df.copy(), target)
    batch_creator.create_batch()
    batch_creator.save_batches_as_jsonl()

Execute the commands to push the batches to the GCP bucket. (**)

cd pipeline
python3 upload_to_gcp.py

Execute the commands to submit batch inferences to Vertex AI. (**)

cd pipeline
python3 batch_inference.py

Run the Python code below to evaluate the regression model.

from pipeline.evaluator import Evaluator

df = pd.read_csv("data/grid.csv")
eval = Evaluator(df=df,
                target="pH",
                folds_file_path="data/folds/data_pH.json",
                predictions_file_path="<YOUR PREDICTION FILE PATH>")

This should execute the code in the repository successfully. If there are problems, you can raise an issue!

** This execution requires billing enabled on GCP.

Note: When I executed these batches in Vertex AI (gemini-2.5-flash), I observed an issue with maxOutputTokens. The internal limit is 65,536 tokens. In a few cases, batches were finished due to exceeding the token limit, and no output was returned as the model was interrupted. I do check for this in pipeline/evaluation.py before extracting the final prediction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pH Regression using CoT

Aim

Environment Setup

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

SoilForestHealth/ph-regressor-cot

Folders and files

Latest commit

History

Repository files navigation

pH Regression using CoT

Aim

Environment Setup

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages