The aim of this project is to create an image classification model to detect yoga positions from videos.
A possible application could be getting a list of all the positions performed during a yoga class video present on Youtube or in your computer; it is also possible to record your practices through your WebCam to get insights about them and keep track of your improvemens.
A Asanas Database was also created to back up the model with information about each position.
- Numpy
- Pandas
- os
- cv2
- requests, BeautifulSoup
- sklearn
- tensorflow, keras, pickle
The following information were obtained for each position that the model can detect:
- Asana Sanskrit Name
- Asana English Name
- Difficulty level
- Pose Type
- Instructions
- Drishti (where to focus the sight)
- Cautions
- Benefits
These were collected webscraping the site yogapedia.com.
All these information are collected in the dataframe data/asanas_df.csv.
The images used to train the model were obtained starting from a Yoga Asanas classification dataset on Kaggle. The dataset has been slighlty modified in order to:
- delete duplicates
- delete images of poorly performed position
- delete images with writings in it or with too many people or objects in the background
- correctly categorize all images
- add new data so that almost the same amount of images was available for each position.
At the end of this process between 40 and 50 images were available for each of the 84 positions.
The key considerations include choosing between Convolutional Neural Networks (CNN) and DenseNet architectures. Additionally, various strategies such as data augmentation, early stopping, learning rate decay, dropout, and experimenting with different parameters are discussed.
CNNs are a natural choice for image classification tasks due to their ability to automatically learn hierarchical features from image data. CNNs consist of convolutional layers that scan the input image with small filters, enabling the model to capture local patterns.
DenseNet, or Densely Connected Convolutional Networks, focuses on connecting each layer to every other layer in a feedforward fashion. This architecture facilitates feature reuse and enhances the flow of information through the network. DenseNet is beneficial when dealing with limited data or when a highly expressive model is needed.
- Data Augmentation: is employed to artificially expand the dataset, helping the model generalize better to unseen data. Techniques such as rotation, flipping, and zooming are applied to augment the training set.
- Early Stopping: prevents overfitting by monitoring the model's performance on a validation set. Training is halted when there is no improvement in the validation accuracy, preventing the model from learning noise in the training data.
- Learning Rate Decay: Learning rate decay involves systematically reducing the learning rate during training. This can help the model converge faster in the beginning and fine-tune more precisely towards the end of training.
- Dropout: is a regularization technique where random neurons are dropped during training, preventing the model from relying too heavily on specific features. This enhances generalization and robustness.
- Experimenting with Different Parameters: The following parameters are systematically varied to optimize model performance:
- Number of Epochs: The number of times the entire training dataset is passed through the neural network. This parameter is adjusted to find the right balance between underfitting and overfitting.
- Batch Size: The number of training examples utilized in one iteration. A smaller batch size may provide regularization effects and reduce memory requirements.
- Optimizer: Different optimization algorithms, such as Adam, SGD, or RMSprop, are tested to identify the one that works best for the specific image classification task.
Below are reported the accuracy results for each combination of batch sizes and optimizer used for CNN model and DenseNet model.
In the end only 30 numbers of epochs were used since the Early Stopping tecnique halted the process before the thirtieth iteration almost in every case.
The final choice was using DenseNet model with 30 epochs, batch size of 16 and Adam optimizer.
I've tested the model with a short yoga sequence video and some results that I think highlights which could be the next possible steps to improve the model are reported below.
This example show how the model can get confused when it is not performed a specific position but the body is transitioning from a position to the next one.
In this case the position tadasana was not guessed correctly in one occasion and this may have happened because the model focused on the shape of the cloud instead of the body.
First of all, the training data should be incremented in order to have at least one hundred images for each positions. It may also be useful to find a way to remove the background both in the training images and in the videos screenshot that the model is used on.
After this, different combinations of model parameters should be performed again in order to find the best ones.
https://www.canva.com/design/DAF2GWbsSOI/9EPb6kACrElXApp04keUYA/edit

