Skip to content

Demographic2Art: How Artist Demographics Influence Artistic Production

Notifications You must be signed in to change notification settings

yangyuwang/Demographic2Art

Repository files navigation

Demographic2Art: How Artist Demographics Influence Artistic Production

Author: Yangyu Wang
Date: May 27, 2025

Table of Contents

  1. Introduction
  2. How Scalable Computing Could Benefit the Research?
  3. Large-Scale Computing Pipeline
    a. Step 1: Scalable Data Collection (WikiArt Scraping & GPT API Usage)
    b. Step 2: Diffusion Model Fine-Tuning (Stable Diffusion v1.5 with LoRA)
    c. Step 3: Deploying Interactive Website
  4. Future Work and Conclusion
  5. Requirement
  6. References

Introduction

How do artist demographics and experiences influence artistic production? Sociologists of art have long noted that “behind every piece of art lies a story intricately tied to the artist’s background”¹ – factors like nationality, gender, education, and personal experiences “can significantly influence an artist’s creative style, content, and audience reception”². The environment an artist grows up in (culture or region) profoundly shapes their artistic endeavors, and attributes such as an artist’s identity, training, and socio-political milieu “can shape their unique creative style and viewpoints”³. Understanding these influences is important not only for academic insight but also for promoting diversity in the arts. Recent work in the humanities emphasizes that diversity in creative fields enhances the social benefits of art, with these benefits increasing when there are no barriers to who can create art and when audiences can engage with works that reflect their own identities⁴. In other words, studying how artists’ demographic backgrounds affect their art can illuminate why representation matters and how cultural experiences manifest in artwork.

From a digital humanities perspective, this research question is especially compelling in the age of big data. Large repositories of art and metadata (artist biographies, etc.) provide a new opportunity to quantitatively explore patterns that were traditionally studied qualitatively. For example, the WikiArt dataset is a massive online collection with “over 250,000 high quality images of historically significant artworks by over 3,000 artists, ranging from the 15th century to the present day”⁵, making it a “rich source for the potential mining of patterns and differences among artists, genres, and styles”⁶. However, such scale also presents challenges: “such datasets are often difficult to analyze and use for answering complex questions... because of their raw formats as image files”⁷. Traditional art-historical methods alone struggle to process thousands of images, but “recent developments in machine learning and image processing open the door” to extracting meaningful patterns from large art collections⁸. Indeed, computational art history projects have shown that digital methods can link thousands of images to identify stylistic motifs or trends over time, revealing insights unobtainable with analog methods⁹. Therefore, our project adopts a computational, scalable approach to tackle the research question at a breadth and depth not feasible before.

How Scalable Computing Could Benefit the Research?

Analyzing hundreds of thousands of artworks and artist records is a big data problem. Attempting to do this with conventional serial workflows would be prohibitively slow. Parallel and cloud-based computing offers a solution. By dividing the work across multiple processors or machines, I can dramatically speed up data collection and analysis. Parallel computing maximizes performance and efficiency in resource use, making it “perfect for high-speed computation and real-time data analysis”¹⁰. In this project, I leverage cloud infrastructure (AWS EC2 instances and S3 storage) and GPUs on a high-performance cluster to achieve in hours what might otherwise take weeks. This scalable approach has several advantages:

  1. Speedup through Parallelism: By running tasks concurrently on 8 separate EC2 instances (rather than one computer), we achieved roughly 8× faster web scraping. Similarly, using multiple GPUs in parallel allows training complex models in a fraction of the time of a single-GPU or CPU run.

  2. Handling Large Data Volumes: Cloud storage (Amazon S3) provides a robust pipeline for tens of thousands of images and text records. Instead of one machine’s limited disk and memory, we utilize distributed storage and memory, enabling us to work with the full WikiArt dataset without downsizing.

  3. Specialized Hardware (GPUs): Training an image generation model like Stable Diffusion is extremely computation-heavy. GPUs are designed for such parallel math operations. Using GPU-accelerated workflows (both on cloud and on our Midway3 cluster) means we can fine-tune models that would be impractically slow on CPUs.

Note that, for this project’s website creation, it would be more suitable to keep everything on the cloud. However, due to student account restrictions, I cannot access AWS GPUs directly and must use Midway3, resulting in additional data transfers and alternative tooling (e.g., Weights & Biases sweeps). These parts are still kept and marked with an asterisk in the following.

In summary, the large scale of data and complexity of deep learning models demanded a scalable approach. Without cloud and HPC resources, collecting and analyzing this data would not be feasible within a reasonable timeframe. My cloud-based, GPU-accelerated pipeline ensures that the research can progress efficiently and handle the full scope of the question.

Large-Scale Computing Pipeline

Step 1: Scalable Data Collection (WikiArt Scraping & GPT API Usage)

The first phase of the project focused on data acquisition and preprocessing. I utilized the WikiArt dataset as my primary source of artworks and artist information. WikiArt is an online art encyclopedia that provides images of artworks along with metadata about the pieces and artists. Given its size (hundreds of thousands of images) as well as API calling time (3s/taks), scraping WikiArt and extracting information by GPT demanded a distributed strategy.

As a prior step of this concurrent acquisition of data, I have already scraped the WikiArt dataset by selenium which is serial and not parallelizable. Data sees "/project/jevans/yangyu/artist_data/artist_data.csv" and "/project/jevans/yangyu/artwork_data/artwork_data_merged.csv". Detailed information of the scraping process please see my other repository here. The following data collection would involve two tasks with similar pipeline as shown in the figure below, which will take urls in databases to different batches, and then launch EC2 instances to scrape and save them onto S3 bucket. *For the student account only, the contents would be accessed and saved to midway3 project folders to take advantage of GPU.

By distributing URLs into different batches and run them simultaneously in different EC2 instances, I can (1) scrape down the images of artworks, and (2) call GPT API to extract information from online wikipedia contents of artists.

For task (1), the whole automatic pipeline is in ec2_scraping.ipynb, which distributed image urls to scraper_batches, and using the launch_scrapers to launch scraper. *For the midway3 process only: After scraping them into an S3 bucket, the s3_to_midway.py would download all images onto midway3 project folder.

For task (2), the whole automatic pipeline is in ec2_gpt.ipynb, which distributed wikipedia urls to gpt_batches, and using the launch_gptloaders to launch gpt_loader, which would use a structured output in GPT-4o-mini to extract artist information from wikipedia urls. *For the midway3 process only: After extracting all the demographic information of artists into an S3 bucket, the s3_to_midway.py would download all json files onto midway3 project folder.

Note that here I checked how large these resources took in the midway3 project file by the following lines:

du -h --max-depth=1 /project/jevans/yangyu

The results as follows show that they are still in a controlable amount of spaces:

35M     /project/jevans/yangyu/invalid_images
1.5M    /project/jevans/yangyu/artist_data
2.6G    /project/jevans/yangyu/diffusion_model
9.0M    /project/jevans/yangyu/artist_demographics
14G     /project/jevans/yangyu/artwork_images
73M     /project/jevans/yangyu/artwork_data
17G     /project/jevans/yangyu

I also set up AWS Simple Notification Service (SNS) to send a message when the EC2 jobs were done, by using lambda function and dynamoDB.

By the end of Step 1, I had built a scalable web scraping pipeline that harvested the WikiArt content and distilled it into a research-ready format. This large-scale dataset will enable us to examine correlations between artist background factors and visual/artistic characteristics in their work.

Step 2: Diffusion Model Fine-Tuning (Stable Diffusion v1.5 with LoRA)

With a substantial collection of art images and associated artist demographics in hand, the next step was to use this data to train a generative art model. The goal was to create a model that could produce images in styles influenced by certain artist demographics or experiences, thereby allowing us to explore our research question in a novel visual way. I chose Stable Diffusion v1.5, a state-of-the-art text-to-image diffusion model, as the base model for fine-tuning. Stable Diffusion is pre-trained on a broad range of images, but I wanted to specialize it using our WikiArt data so that it learns to generate art reflective of the styles/patterns in that dataset. I downloaded the diffusion model baseline by download_sdv15.py. The fine-tuning approach is as the pipeline followed:

To efficiently fine-tune Stable Diffusion, I employed LoRA (Low-Rank Adaptation), a technique for adapting large models without full retraining. LoRA inserts small trainable weight matrices into the model, drastically reducing the number of parameters that need to be updated during fine-tuning. This makes the process much faster and more memory-efficient than naive full-model training. In fact, LoRA is known as “a groundbreaking and efficient fine-tuning technique that harnesses the power of advanced models for custom tasks and datasets without straining resources or incurring excessive costs.” Using LoRA, we can fine-tune Stable Diffusion on our art data with modest GPU memory and within a reasonable time, which was critical given our resource constraints.

For fine tune a model with LoRA approach, a grid search would be performed on the different hyperparameter settings of 3 different learning rates and 2 seeds, as written in grid_config.py and train_lora_desc2art.py. For there are 6 grids in total, I utilized parallelization upon GPUs by using array settings in midway3, as in train_lora_desc2art.sbatch, which could automatically be started by train_lora_desc2art.sh. This step could also be finished by Weights & Biases which is an online AI developer tools. These files prepared to use Weights & Biases are created as well in all sweep_ files.

After the grid search, I evaluated the outputs both qualitatively, by inspecting generated images (saved in samples), and quantitatively, by checking loss curves (saved in plots). I selected the best fine-tuned model – model with learning rate of 1e-03 and seed of 42. This model's final output at step 20,000 and the loss curve are as follows. Though the loss curve seems fluctuated, the model qualitatively shows good quality. I would like to train it upon more cleaned datasets.

By leveraging Midway3’s parallel computing for this step, I dramatically accelerated the model development. Training a diffusion model is computationally intense, but parallelizing across GPUs meant we could experiment freely and get results within the limited time (1.5 hours in total).

Step 3: Deploying Interactive Website

The final step of the project was to make the fine-tuned model accessible through an interactive web application. The idea is to allow the public, to explore how the model generates images and to see firsthand how changing input parameters (especially those related to artist demographics or experiences) might influence the art produced. This serves both as a demonstration of our research and as a tool for potential qualitative analysis (by letting users visualize the model’s “imagination” of art under different conditions). The whole pipeline of the deployment is as follows:

For visualization, I built a simple web front-end using Flask, a lightweight Python web framework, by app.py. The Flask app provides an interface where a user can input a text prompt. When the user submits a prompt, the app uses our fine-tuned Demographic2Art model (running on a GPU server) to generate an image for that prompt. The model inference – the process of turning the text prompt into an image – is executed on an GPU on Midway3 by app.sbatch, because generating a 512×512 image with diffusion in a reasonable time (a few seconds) benefits greatly from GPU acceleration.

Currently, our deployment is somewhat limited by the constraints of our student AWS account. The Flask app is running on the Midway3 cluster for demonstration purposes, which means access is restricted, as it is only public in the node. I can and did ssh the website to local machine by the following line (you can get the ip address in app.out):

ssh -L 8000:<ip-address>:8000  <cnetid>@midway3.rcc.uchicago.edu

A website example please see below:

*Note that with a personal or unrestricted AWS account, I could fully deploy this web app on a cloud GPU instance (for example, a g4dn or g5 instance on EC2) and make it openly accessible on the internet. This would allow broader scalability – multiple users could generate images, and I could even autoscale with demand or deploy on a service like AWS SageMaker or Heroku with GPU support if available. The current "website" demonstrates the functionality on the Midway3 GPU. Expanding to a full cloud deployment in the future (under a personal account or grant resources) would enable anyone to access the model at any time, fulfilling the goal of public engagement and reproducibility.

Nonetheless, even with the limited deployment, the website stands as a proof-of-concept interface. It is the culmination of our pipeline – from data to model to interactive exploration – and shows how the research can be presented in an accessible, visual format. We consider this an important outcome of the project: moving beyond static analysis to an engaging tool that invites further inquiry.

Future Work and Conclusion

In this project, I built a scalable pipeline to investigate how artists’ demographics and experiences might influence their art, harnessing the power of large-scale computing tools including AWS and midway3. I collected a large art dataset, fine-tuned a generative model, and deployed an interactive demo. While the technical infrastructure and model are in place, in-depth analysis of the results is an important next step. For example, researchers could conduct systematic visual analyses of images generated with different demographic prompts to identify biases or stylistic differences introduced by those prompts. One might generate a portfolio of images for paintings of "an Artist from Group A" vs "an Artist from Group B" to compare them for distinctive features. Such studies could yield insights into the model’s internalization of demographic influences and, by extension, prompt discussions about the real-world art dynamics it learned from.

In conclusion, this project demonstrates an approach to a complex sociological and humanity research question using scalable computing. By addressing the limitations of serial processing through cloud parallelism and GPU acceleration, we successfully assembled a novel tool – a fine-tuned diffusion model and web app – that can be used to experimentally probe the influence of artist demographics on art. We believe this kind of interdisciplinary effort, combining sociological theory, digital humanities data, and cutting-edge machine learning, paves the way for new forms of cultural analysis. As digital archives grow and generative models improve, researchers in the arts and social sciences have unprecedented means to explore questions of creativity, representation, and influence on a vast scale.

Requirement

Ensure you have the access to AWS (updating your credential file) and midway3. After loading into midway3, please load the python module and download packages by the following lines in terminal:

module load python
conda create -n sd-gpu python=3.10 -y
conda activate sd-gpu

pip install --upgrade pip
pip install -r requirements.txt

References

  1. Becker, H. S. (1982). Art Worlds. University of California Press.
  2. Bourdieu, P. (2010). Distinction: A social critique of the judgement of taste. Routledge.
  3. Dimaggio, P. (1982). Cultural entrepreneurship in nineteenth-century Boston: The creation of an organizational base for high culture in America. Media, Culture & Society, 4(1), 33–50. https://doi.org/10.1177/016344378200400104
  4. Risam, R. (2015). Beyond the Margins: Intersectionality and the Digital Humanities. Digital Humanities Quarterly, 9(2). https://www.digitalhumanities.org/dhq/vol/9/2/000208/000208.html
  5. WikiArt. 2025. About WikiArt. Www.Wikiart.Org. Retrieved May 27, 2025, from https://www.wikiart.org/en/about
  6. Saleh, B., & Elgammal, A. (2015). Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature (arXiv:1505.00855). arXiv. https://doi.org/10.48550/arXiv.1505.00855
  7. Elgammal, A., Liu, B., Elhoseiny, M., & Mazzone, M. (2017). CAN: Creative Adversarial Networks, Generating “Art” by Learning About Styles and Deviating from Style Norms (arXiv:1706.07068). arXiv. https://doi.org/10.48550/arXiv.1706.07068
  8. Wasielewski, A. (2023). Computational Formalism: Art History and Machine Learning. The MIT Press. https://doi.org/10.7551/mitpress/14268.001.0001
  9. Brachmann, A., & Redies, C. (2017). Computational and Experimental Approaches to Visual Aesthetics. Frontiers in Computational Neuroscience, 11. https://doi.org/10.3389/fncom.2017.00102 10.Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. https://doi.org/10.48550/arXiv.2106.09685

About

Demographic2Art: How Artist Demographics Influence Artistic Production

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published