Hi,
I am working on a machine learning project where my objective is to predict crop yield using various features. My workflow involves identifying the best-performing regression models, simplifying them while maintaining good performance metrics, and ensuring logical interpretability. However, I am encountering challenges in implementing certain parts of the workflow, and I would greatly appreciate your assistance.
Project Goals and Workflow:
Splitting data correctly
Since the climate is very different across the sites, here are these steps for splitting the data.
– first split the dataset water using the median as a threshold. This ensures that both the training and testing sets have data points representing both low and high water .
After splitting by TWB, the data should then be further split by siteyear
This two-step splitting approach will ensure that the model is trained and tested on diverse conditions (both high and low water balance) while avoiding any data leakage caused by overlapping siteyear groups.
Model Comparison:
I want to evaluate different models (e.g., CatBoost, Random Forest, XgBoost and SVT.) to identify the one or two models that work best in terms of R² and RMSE.
Feature Selection for Simplicity:
Once the best models are identified, I aim to reduce the number of features to create a simpler model with good generalization.
The selected features should not only have good performance metrics but also make sense from an scientific perspective
Finding the Best Number of Features:
I plan to experiment with different numbers of features and evaluate which subset provides the optimal balance between simplicity and performance.
Cross-Validation and Hyperparameter Tuning:
For cross-validation, I am using GroupKFold to respect the grouping structure defined by siteyear.
Final Model and Performance Evaluation:
After selecting the best model and features, I want to train the final model and compare the actual vs. predicted yields.
Thank you