An automated job data extraction system that scrapes job listings from Indeed, processes the data, and exports it to Excel format for analysis.
- Dynamic Job Querying: Specify custom job titles and result limits
- Real-time Data Fetching: Utilizes Apify's Indeed Job Scraper API
- Intelligent Monitoring: Tracks scraping progress with partial results
- Data Cleaning: Converts HTML job descriptions to clean, readable text
- Excel Export: Generates structured spreadsheets with comprehensive job data
- High Performance: Processes up to 1,000 job listings in 1-2 minutes
Before running this project, ensure you have:
- Python 3.x installed
- An Apify account with API token
- Required Python packages (see installation section)
- Clone the repository:
git clone https://github.com/yourusername/indeed-job-scraper.git
cd indeed-job-scraper- Install required packages:
pip install requests pandas beautifulsoup4- Set up your Apify API token:
- Sign up at Apify
- Get your API token from the dashboard
- Replace
[SECURE_TOKEN]in the code with your actual token
- Run the script:
python job_scraper.py-
Enter the required information when prompted:
- Job title (e.g., "Software Engineer", "Data Scientist")
- Maximum number of results to fetch (up to 1,000)
-
Wait for the scraping process to complete
-
Find your results in the generated Excel file:
{job_title}_cleaned_jobs.xlsx
The generated Excel file contains the following columns:
| Column | Description |
|---|---|
| Job Title | Position name |
| Company | Employer name |
| Location | Job location |
| Salary | Salary information (if available) |
| Job Type | Employment type (full-time, part-time, etc.) |
| Rating | Company rating |
| Reviews | Number of company reviews |
| Posted | Job posting date |
| Apply Link | Direct application link |
| Description | Clean job description text |
The system follows a linear processing pipeline:
User Input → API Trigger → Status Monitoring → Data Retrieval → Data Processing → Excel Export
- Python 3.x - Core programming language
- Requests - HTTP API communication
- Apify API - Web scraping service
- BeautifulSoup4 - HTML parsing and text extraction
- Pandas - Data manipulation and Excel export
- Processing Capacity: Up to 1,000 job listings per run
- Execution Time: 1-2 minutes for typical queries
- Success Rate: High reliability with intelligent error handling
- Time Savings: 90% reduction compared to manual job searching
- Single Platform: Limited to Indeed.com only
- API Dependency: Requires Apify service availability
- Rate Limits: Subject to Apify's usage policies
- Local Storage: Files saved locally (no cloud integration)
- Multi-platform support (LinkedIn, Monster, etc.)
- Real-time dashboard integration
- Machine learning for job recommendations
- Cloud deployment and storage
- Advanced filtering and search options
- Automated scheduling and notifications
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the project
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This tool is for educational and research purposes. Please ensure compliance with Indeed's Terms of Service and robots.txt when using this scraper. Be respectful of the website's resources and implement appropriate rate limiting.
Murali Krishna M.
- Project Type: Individual Project
- Focus: Web Scraping & Data Automation
- Apify for providing the web scraping infrastructure
- Indeed for being the data source
- Python community for excellent libraries and documentation
If you encounter any issues or have questions, please open an issue on GitHub or contact murali.krishna1591@gmail.com
⭐ Star this repository if you found it helpful!