A web scraping tool for extracting and analyzing placement data from the college placement portal.
This project automates the extraction of company information from placement.******.ac.in, transforming it into a structured format suitable for data analysis. It handles authentication, data extraction, and standardization of inconsistent information.
- Python 3.7+
- Edge WebDriver
- Python packages: requests>=2.28.0 beautifulsoup4>=4.11.0 selenium>=4.0.0 pandas>=1.4.0
- Install required packages
- Configure credentials in
extractor.ipynb(username/password):
export USER="whatever"
export PASS="whatever"
The extraction process involves two steps:
- Run the link extraction script:
This generates companies.csv with all company URLs
- Run
extractor.ipynbnotebook to(manual intervention):- Process the extracted links
- Scrape detailed company information
- Standardize data structure
- Create a pandas DataFrame
- Export data to
companies_processed.csv
Companies #136 and #137 have inconsistent data structures that require manual handling.
For new issues: sgsonu132@gmail.com (Sumit)
Made for personal purpose. The code can be used for many purposes with tweaks