Chain-of-Thought vs. Few-Shot: A Comparative Study of Prompting Strategies for Code Generation
This repository accompanies the research study and provides the code, dataset, and analysis artifacts referenced in the paper (see Associated Publication).
- Koorosh Nobakhtfar — Technical Analyst [GitHub | LinkedIn]
- Kenan Çakılcı — Dataset Architect [GitHub | LinkedIn]
- Ruken Zilan — Research Supervisor [LinkedIn]
- Authored by us; uses the
tiktokenlibrary for tokenization. - Installation instructions for
tiktokenare available in the official repository: https://github.com/openai/tiktoken.
- Contains the raw data for prompts, responses, and human evaluations.
- Includes basic analytics for quick inspection.
- ANOVA conducted using the Analysis ToolPak add-in.
- Effect sizes (eta-squared, partial eta-squared, and omega-squared) were calculated manually using standard definitions (see Effect Sizes).
- To enable the add-in in Excel:
- File → Options → Add-ins
- From Manage, select Excel Add-ins, click Go…
- Check Analysis ToolPak, click OK.
- Organized by the combination of Reasoning-Style (CoT vs. Non-CoT) and Example-Context (Zero-Shot vs. Few-Shot).
- Each combination contains 20 tasks (cases).
- Each task has three files:
- Prompt file: the prompt authored by the LLM.
- Response file: the LLM’s response to that prompt.
- Data file: structured metadata, evaluation results, and other task-level information.
Example layout:
Dataset/
├─ CoT Few-Shot (CFS)/
│ ├─ CFS 1/
│ │ ├─ task_031_data.json
│ │ ├─ task_031_prompt.txt
│ │ └─ task_031_response.txt
│ ├─ CFS 2/
│ │ ├─ ...
│ │ └─ ...
│ └─ ...
├─ CoT Zero-Shot (CZS)/
│ ├─ CZS 1/
│ │ ├─ ...
│ │ └─ ...
│ └─ ...
├─ Non-CoT Few-Shot (NCFS)/
│ ├─ NCFS 1/
│ │ ├─ ...
│ │ └─ ...
│ └─ ...
└─ Non-CoT Zero-Shot (NCZS)/
├─ NCZS 1/
│ ├─ ...
│ └─ ...
└─ ...
- Data files are designed to store information about the prompt and response.
- These files were originally produced by the LLM and then edited by humans to correct metadata and add additional information where necessary to ensure accuracy and completeness.
Notes on stored evaluations:
- Self-evaluation refers to the model’s own assessment (values typically on a 1–10 scale). These were retained for completeness but disregarded in analysis due to unreliability.
- Supervised (human) evaluations were performed by real evaluators. The rubric fields are:
| field | weight | range | explanation |
|---|---|---|---|
factual_correctness |
25% | 1 to 5 | Are the facts and steps correct? |
Reasoning_quality |
25% | 1 to 5 | Is the logic transparent? |
coherency_and_clarity |
20% | 1 to 5 | Is the response clear and easy to follow? |
completeness |
20% | 1 to 5 | Does it cover all required aspects? |
understanding_depth |
10% | 1 to 5 | Does it show insight beyond surface-level? |
weighted_total |
N/A | 0 to 100 (pct) | Final composite score from weights |
For more information regarding the evaluation of accuracy, see the Accuracy Evaluation Process and Criteria file.
- Eta-squared (η²), partial eta-squared (ηp²), and omega-squared (ω²) were derived from the ANOVA results using their standard formulas based on sums-of-squares (SS), mean-squares (MS) and degrees-of-freedom (df).
- Some formulas are not presented in our paper
- For more information regarding the formulas, see the reference below.
Reference for effect-size formulas
B. G. Tabachnick and L. S. Fidell, Using Multivariate Statistics, 6th ed., Upper Saddle River, NJ: Pearson Education, 2013, pp. 54–55.
This repository contains the source code and materials for the work described in our paper, "Chain-of-Thought vs. Few-Shot: A Comparative Study of Prompting Strategies for Code Generation," by K. Nobakhtfar, K. Çakılcı, R. Zilan.
Current Status: Accepted by the 5th International Informatics and Software Engineering Conference (IISEC 2026).
Note: The final, peer-reviewed version of the paper may contain minor changes. We will update this section upon acceptance or publication.