This repository contains a collection of Python-based automation scripts designed for business intelligence and lead generation. The toolkit is divided into two main modules, each serving a distinct data extraction purpose.
A powerful web monitoring tool that continuously scans a list of websites for specific keywords and logs the results to a Google Sheet.
Features:
- Continuous Monitoring: The script runs in a loop, automatically checking for new websites added to
sites.txt. - Dynamic Content Handling: Uses the Playwright library to control a headless Chromium browser, enabling it to scrape modern, JavaScript-heavy websites.
- Cloud Integration: Authenticates with Google Sheets using a
secret.jsonservice account file and appends results directly to a specified spreadsheet. - Resilient: Includes error handling and retry logic for network issues.
Use Cases:
- Competitive analysis.
- Brand mention tracking.
- Market research and product hunting.
- Monitoring websites for specific updates or content changes.
A specialized web scraper designed to extract detailed company information from the Brazilian government portal cib.dpr.gov.br (Cadastro de Intervenientes em Operações de Comércio Exterior).
Features:
- Targeted Extraction: Precisely parses the HTML of the CIB portal to extract valuable company data.
- Data Points: Collects Company Name, CNPJ (Tax ID), Email, Website, Key Contact Person, Import Range, and Address.
- Lightweight & Efficient: Uses the
requestsandBeautifulSouplibraries for fast and efficient scraping of server-rendered pages. - Local Storage: Saves all extracted data neatly into a
empresas.csvfile for easy access with Excel or other data analysis tools. - Evolved Scripts: Includes several versions of the script (
cib.py,cib2.py, etc.), showcasing different functionalities like saving to CSV vs. Google Sheets.
Use Cases:
- Building lead lists for sales and marketing teams.
- Market analysis of import/export companies.
- Creating a database of potential business partners or suppliers.
- Clone the repository.
- Install dependencies:
pip install requests beautifulsoup4 playwright gspread oauth2client google-api-python-client colorama playwright install
- Configure Credentials: Populate the
secret.jsonfiles with your own Google Cloud Platform service account credentials to enable Google Sheets integration. - Customize Inputs: Edit the
.txtfiles in each module (sites.txt,palavras.txt) to match your specific targets. - Run the scripts:
python Palavra-chave/extrator.py # or python CIB/cib.py