GitHub - neha-chawla/DataCollection: Data-scraping application using selenium

README for webscraperso.py

This program uses selenium to scrape stackoverflow questions to retrieve questions and respective answers. The result is stored into a CSV file with every row being a unique question to what was searched.

stackoverflow.com/robots.txt states that bots are not allowed to search. However, selenium can read an already existing page that has passed the captcha verification. The website also allows for selenium to go through pages as it is not technically searching new things. This 'page flipping' is done until either a captcha verification is needed again or the last page has been scraped.

Thus, downloading a chrome driver is needed.

Application of program:

Once downloaded and set up open chrome driver: Google\ Chrome--remote-debugging-port=9222 --user-data-dir="~/ChromeProfile" in terminal.

In the chrome window that appears search for desired tag on stack overflow. Run the webscraperso.py program and it will prompt you to enter the current url. After this, the program should parse through the search pages and retrieve all question URLs and goto each URL to retrieve question and answers.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
StopWords_DatesandNumbers.txt		StopWords_DatesandNumbers.txt
StopWords_GenericLong.txt		StopWords_GenericLong.txt
StopWords_Names.txt		StopWords_Names.txt
webscraperso.py		webscraperso.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

neha-chawla/DataCollection

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages