Ask the experts: sourcing a high-quality nutrition counseling dataset through Human-AI collaboration
Large Language Models (LLMs) are being employed by end-users for various tasks, including sensitive ones such as health counseling, disregarding potential safety concerns. It is thus necessary to understand how adequately LLMs perform in such domains. We conduct a case study on ChatGPT in nutrition counseling, a popular use-case where the model supports a user with their dietary struggles. We crowd-source real-world diet-related struggles, then work with nutrition experts to generate supportive text using ChatGPT. Finally, experts evaluate the safety and text quality of ChatGPT’s output. The result is the HAI-coaching dataset, containing ~2.4K crowdsourced dietary struggles and ~97K corresponding ChatGPT-generated and expert-annotated supportive texts. We analyse ChatGPT’s performance, discovering potentially harmful behaviours, especially for sensitive topics like mental health. Finally, we use HAI-coaching to test open LLMs on various downstream tasks, showing that even the latest models struggle to achieve good performance.
You can find the published paper on the ACL Anthology.
If you use HAI-Coaching, please cite it as:
@inproceedings{balloccu-etal-2024-ask,
title = "Ask the experts: sourcing a high-quality nutrition counseling dataset through Human-{AI} collaboration",
author = "Balloccu, Simone and
Reiter, Ehud and
Li, Karen Jia-Hui and
Sargsyan, Rafael and
Kumar, Vivek and
Reforgiato, Diego and
Riboni, Daniele and
Dusek, Ondrej",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.674",
pages = "11519--11545",
abstract = "Large Language Models (LLMs) are being employed by end-users for various tasks, including sensitive ones such as health counseling, disregarding potential safety concerns. It is thus necessary to understand how adequately LLMs perform in such domains. We conduct a case study on ChatGPT in nutrition counseling, a popular use-case where the model supports a user with their dietary struggles. We crowd-source real-world diet-related struggles, then work with nutrition experts to generate supportive text using ChatGPT. Finally, experts evaluate the safety and text quality of ChatGPT{'}s output. The result is the HAI-coaching dataset, containing {\textasciitilde}2.4K crowdsourced dietary struggles and {\textasciitilde}97K corresponding ChatGPT-generated and expert-annotated supportive texts. We analyse ChatGPT{'}s performance, discovering potentially harmful behaviours, especially for sensitive topics like mental health. Finally, we use HAI-coaching to test open LLMs on various downstream tasks, showing that even the latest models struggle to achieve good performance. HAI-coaching is available at https://github.com/uccollab/hai-coaching/",
}
The HAI-coaching dataset is in the dataset.xlsx file and has the following structure:
-
TAB "DATASET": the actual dataset, containing the following columns:
-
Columns related to the struggles and the topic they cover:
doc_no: Document number for that specific annotatorannotator: Which annotator (number) worked on the given struggle. If "ALL", then the struggle was used for IAA, hence annotated by everyone.struggle: The struggle, after typo correction.cluster_auto: Coarse clustering, obtained automatically. Hyperparameters were set to capture main topics. Has been used as an aiding tool during topic modelling with experts.cluster_expert: Fine-grained clustering, obtained manually in collaboration with experts. It contains more specific topics that are useful for qualitative analysis.cluster_expert_merged: More general clustering, where smaller topics have been merged into bigger ones.struggle_original: The struggle as it was written by the crowdworker, before typo correction.full_embeddings: Embeddings for the struggle, from all-mpnet model. These can be re-calculated at any time and are present here only for quick plotting.reduced_embeddings: Embeddings after PCA, for 3D plotting purposes.
-
Candidates from ChatGPT and their annotation. For IAA, majority voting was used.
OT: Whether the struggle is off-topic or notreflection_candidates: Reflection generated by ChatGPT, divided by the "###" separatorreflection_annotation: Whether each candidate is safe or not, divided by the "###" separatorreflection_from_expert: Optional candidates written by experts, divided by the "###" separator
The same structure echoes for all kind of supportive text (reframing, comfort and suggestion)
-
-
TAB "STATS": presents some basic counts based on the clustering
-
TAB "INFO": recaps dataset structure (like this README) and also shows the merging logic for clusters
- Columns related to the demographics for each crowdworker have been removed for data privacy. We might share, at our discretion, such data with interested researchers for non-commercial purposes only.
The file dataset_parsing.ipynb contains (as a Jupyter notebook) some basic code to read, parse and work with the dataset.
We also release all the relevant documents and material used in our experiments for struggles collection, clustering, prompt engineering, prompting ChatGPT and safety annotation. Each step has its own directory and README file.
The code to reproduce our NLP baselines can be found in the "evaluation" folder.
This work has been funded by the EC in the H2020 Marie Skłodowska-Curie PhilHumans project (contract no. 812882) and the European Research Council (Grant agreement No. 101039303 NG-NLG).
