This repository contains scripts for scraping artist and artwork information from WikiArt using Selenium. The dataset generated from these scripts includes detailed information about artists and their artworks. Wikipedia expanded data is from kaggle as a basic list of artists to extract.
Author: Yangyu Wang
Date: January 18, 2025
This Jupyter notebook contains the code for scraping artist information and their artworks from WikiArt. The main steps include:
- Generating artist names from an existing dataset.
- Opening Firefox using Selenium WebDriver.
- Extracting artist information and artworks.
- Saving the extracted data into CSV files.
- Re-scraping for not found items.
- Results see artist_data_new.csv
- Results see artist_artwork.csv
This Jupyter notebook focuses on scraping detailed information about artworks from WikiArt. The main steps include:
- Extracting artwork information from provided URLs.
- Handling previously found and unfound URLs.
- Saving the extracted data into CSV files.
- Results see artwork_data_image_done.csv
- Merged version see artwork_data_merged.csv
This Jupyter notebook is designed for scraping art images using the Requests library. The main steps include:
- Loading artwork data from a CSV file.
- Renaming columns for consistency.
- Saving the modified data to a new CSV file.
- Listing files in the target directory.
- Downloading images using the
img2datasetlibrary. - Displaying an example of the scraped images.
- Results see Wikiart Images
This Jupyter notebook is dedicated to scraping artist information from Wikipedia. The main steps include:
- Loading artist data from a CSV file.
- Defining a function to scrape Wikipedia pages.
- Iterating through the list of artist URLs.
- Handling errors and saving the scraped data into HTML and TXT files.
- Results see artist_wikipedia_content
The dataset generated from these scripts includes:
- artist_data_new.csv: Contains detailed information about artists.
- artist_artwork.csv: Contains information about artworks associated with artists.
- artwork_data_all.csv: Contains detailed information about individual artworks.
- Wikiart Images: Contains all of the image data and their links to artwork_data_all.csv.
- artist_wikipedia_content: Contains text files of artist Wikipedia pages.
- Python 3.10.0
- uv
You can use uv sync after installation of uv, to syncronize all the requirements of the scraping. For jupyter notebook, please use the .venv generated by uv.
- The scraping process may take a significant amount of time due to the large number of artists and artworks.
- Ensure that the Geckodriver version is compatible with the installed Firefox version.