Skip to content

neha-chawla/DataCollection

Repository files navigation

README for webscraperso.py

This program uses selenium to scrape stackoverflow questions to retrieve questions and respective answers. The result is stored into a CSV file with every row being a unique question to what was searched.

stackoverflow.com/robots.txt states that bots are not allowed to search. However, selenium can read an already existing page that has passed the captcha verification. The website also allows for selenium to go through pages as it is not technically searching new things. This 'page flipping' is done until either a captcha verification is needed again or the last page has been scraped.

Thus, downloading a chrome driver is needed.

Application of program:

Once downloaded and set up open chrome driver: Google\ Chrome--remote-debugging-port=9222 --user-data-dir="~/ChromeProfile" in terminal.

In the chrome window that appears search for desired tag on stack overflow. Run the webscraperso.py program and it will prompt you to enter the current url. After this, the program should parse through the search pages and retrieve all question URLs and goto each URL to retrieve question and answers.

About

Data-scraping application using selenium

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages