This project explores clustering techniques to identify patterns within a dataset. The primary goal was to apply different clustering algorithms, evaluate their effectiveness, and determine the best method for grouping data points.
Dimensionality reduction techniques, such as Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), were used to visualize the high-dimensional data effectively. Clustering algorithms, including K-Means and DBSCAN, were applied to identify meaningful clusters, and various evaluation metrics were used to assess their performance.
- Loaded the dataset and checked for missing values or anomalies.
- Applied feature scaling to standardize the dataset before clustering.
- Applied PCA (Principal Component Analysis) to reduce the dataset dimensions while preserving variance.
- Applied t-SNE (t-Distributed Stochastic Neighbor Embedding) for better cluster visualization.
- Observed that t-SNE outperformed PCA in visual separability, as it captured local structures more effectively.
- Used the Elbow Method to determine the optimal number of clusters (
K=2). - Applied K-Means clustering with the chosen
Kvalue. - Evaluated cluster quality using:
- Silhouette Score: 0.2269 (low but indicates some structure)
- Davies-Bouldin Index: 1.6356 (moderate cluster separation)
- Inertia: 29,322,746 (sum of squared distances to cluster centroids)
- Applied DBSCAN (Density-Based Spatial Clustering) with different
epsandmin_samplesvalues. - Despite tuning, DBSCAN failed to form meaningful clusters, likely due to a lack of density variations in the dataset.
- Most points were either labeled as noise or assigned to a single cluster.
- K-Means performed better in this scenario, as it formed moderately distinct clusters.
- DBSCAN struggled, indicating that the dataset might not have the density-based structure it relies on.
- Dimensionality reduction improved visualization but did not significantly impact DBSCAN’s clustering quality.
- PCA vs. t-SNE: t-SNE provided better cluster separability than PCA in visualization.
- K-Means vs. DBSCAN: K-Means worked well with structured clusters, whereas DBSCAN was ineffective due to data distribution.
- Real-World Relevance: In practical applications like speech or text clustering, choosing the right algorithm based on dataset characteristics is crucial.
This project demonstrated the importance of dimensionality reduction in clustering, the effectiveness of K-Means for structured data, and the challenges of using DBSCAN when density variations are absent. The results highlight the need for data-driven algorithm selection to achieve meaningful clustering outcomes.
- Experiment with different distance metrics in K-Means.
- Apply Hierarchical Clustering to explore alternative approaches.
- Improve DBSCAN performance by adjusting feature scaling or selecting alternative similarity measures.