WebMD Spider

In this project, I use Python's Scrapy library to crawl https://www.webmd.com/drugs/2/index and extract the name and usage of all the drugs present. The code is ran using a Jupyter notebook or Google Colab or any other similar notebook interpreter. We export our crawled findings to a .csv file. The data acquired here may be used to cross-reference other medical datasets such as those from the CDC to perform EDA or other investigations on drug data.

I manually extracted each relevant XPath on WebMD's drug pages using Chrome DevTools (inspect element) to make the retrieval process more precise. Some pages on WebMD are different that others. For example, this page https://www.webmd.com/drugs/2/drug-63164/adderall-xr-oral/details is different from https://www.webmd.com/drugs/2/drug-7277/percocet-oral/details which means they are constructed using different HTML elements and thus require a custom XPath extraction.

The spider extracts the following data from each drug page:

Drug Name
Usage
Condition
How to Use
Generic Name
Brand Name

Here's a snapshot of what the output .csv file looks like:

Tools

Python 3
Scrapy https://scrapy.org/ and https://anaconda.org/conda-forge/scrapy

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
README.md		README.md
scrapy_spider.ipynb		scrapy_spider.ipynb
webmd2.csv		webmd2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebMD Spider

Tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebMD Spider

Tools

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages