Skip to content

This repository contains hands-on experiments with: text generation using small language models , prompt engineering techniques , word and sentence embeddings , semantic similarity using cosine distance and transformer-based applications

Notifications You must be signed in to change notification settings

rishabhpatre/llm-and-embedding-experiments

Repository files navigation

LLM and Embeddings Experiments

This repository contains hands-on experiments with:

  • text generation using small language models
  • prompt engineering techniques
  • word and sentence embeddings
  • semantic similarity using cosine distance
  • transformer-based applications

The goal is to understand practical behaviour of LLMs and embeddings through experimentation.

Repository Structure

  • llm_foundations.py – Text generation and tokenisation experiments
  • prompt_engineering.py – Summarisation, Q&A, and creative prompting
  • embeddings.py – Word and sentence embeddings using GloVe
  • sentiment_analysis.py – Transformer-based sentiment analysis demo

Critical Conclusions & Design Decisions

This project uses different models for different tasks based on observed behaviour rather than convenience. During experimentation, it became clear that no single small language model performs well across all use cases.

  • DistilGPT-2 was used for free-form text generation because it produces diverse and creative continuations when sampling is enabled, making it suitable for open-ended prompts. However, it was less reliable for structured tasks such as summarisation or factual question answering.

  • Flan-T5 (instruction-tuned) was used for summarisation, Q&A, and prompt engineering experiments. It followed explicit instructions more consistently, and its outputs improved significantly when few-shot examples and decoding controls such as beam search and repetition penalties were applied. Flan-T5 was also preferred for tokenisation experiments due to its cleaner and more interpretable subword tokens.

A key insight from the project is the difference between LLMs and embeddings. LLMs are probabilistic and generative, making them powerful but sometimes inconsistent. Embeddings, on the other hand, are deterministic representations of meaning and are highly reliable for similarity, clustering, and retrieval tasks. Using GloVe embeddings with cosine similarity demonstrated how semantic relationships can be captured without text generation.

Overall, the main takeaway is that effective NLP systems are task-driven and often combine generative models with embeddings, rather than relying on a single model for all problems.

About

This repository contains hands-on experiments with: text generation using small language models , prompt engineering techniques , word and sentence embeddings , semantic similarity using cosine distance and transformer-based applications

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages