Skip to content
View omarkamali's full-sized avatar

Block or report omarkamali

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
omarkamali/README.md

Omar Kamali

Building multilingual LLMs, data systems, and agentic tooling with a bias toward low‑resource languages and big lab-style training with scruffy resources.

  • Founder of Omneity Labs, independent GenAI R&D lab for low‑resource languages, cultural alignment and sovereign Gen AI stack development
  • Focus: pretraining data pipelines, LLM training curricula, instruction residuals, hybrid search, and multi‑provider LLM infra
  • Values: pragmatic engineering, reproducible research, and tools that make real systems easier to build

TL;DR – Start here

If you are new here, these are the best entry points:

  • Universal LLM client: borgllm – drop‑in LangChain‑compatible client for 20+ providers with API key rotation and rate‑limit handling.
  • Wikipedia pretraining data: wikisets + wikipedia-monthly – fresh monthly multilingual Wikipedia dumps and a flexible dataset builder.
  • Instruction residuals / task arithmetic: residuals – small, focused library for task vectors and efficient continuous pretraining workflows.
  • Hybrid search engine: semango – lexical + semantic search (BM25 + vectors) with an HTTP API, MCP server, and search UI.
  • GPU monitoring: picomon – minimal curses dashboard for AMD GPU monitoring.

Everything else tends to plug into one of these pillars: data, training, or serving.


What I work on

  • Multilingual & low‑resource LLMs. From data acquisition and cleaning to curriculum learning, instruction tuning, and task vectors – with a focus on languages often ignored by big tech.
  • Data systems for pretraining. Large‑scale text pipelines, monthly‑updated Wikipedia dumps, and tools for querying terabyte‑scale corpora without fully downloading them.
  • Agentic & production tooling. Universal LLM routers, hybrid search engines, and small utilities that make real‑world deployments less painful.

If you are building something in this space and want to collaborate, open an issue on the relevant repo or reach out via the links at the end.


Projects by area

1. LLM training, curricula, and task vectors

  • craft – Contrastive Representation Aware Fine‑Tuning toolkit for representation‑sensitive LLM finetuning experiments.
  • curriculus – Progressive curriculum learning for LLM training with fine‑grained schedule and difficulty control.
  • residuals – Lightweight library for instruction residuals / task vectors, aimed at efficient continuous pretraining and task arithmetic workflows.

Use these if you care about shaping how a model learns (curriculum) and what gets injected post‑hoc (task vectors, residuals).


2. Data & pretraining pipelines

  • wikisets – Flexible Wikipedia dataset builder with language‑aware sampling and pretraining‑oriented splits, built on top of the monthly Wikipedia dumps.
  • wikipedia-monthly – Fresh, cleaned Wikipedia dumps for 300+ languages, updated monthly and ready to load via Hugging Face Datasets.
  • hypersets – Query huge datasets with simple SQL using DuckDB; work with terabyte‑scale Hugging Face datasets without fully downloading them.
  • unscript – Script‑aware text cleaning for NLP and training, with attention to multilingual and multi‑script corpora.
  • vocabulous – Bootstrapping language detection from noisy and ambiguous data, useful for messy multilingual sources.

If you are assembling a multilingual or low‑resource pretraining corpus, this is the stack to look at first.


3. LLM infrastructure, agents, and search

  • borgllm – Zero‑config universal LLM client with support for many providers, API key rotation, rate‑limit management, and LangChain compatibility.
  • semango – Hybrid search engine combining BM25 and vector search, with an HTTP API, MCP server, and an embedded search UI for quick experiments.
  • Omneity Labs API (external) – Sovereign Gen AI platform serving Multilingual LLMs, embeddings, translation, and transliteration for low‑resource languages, powering production systems for languages like Moroccan Arabic.

These are the right tools if you are wiring LLMs into applications, need routing across providers, or want search that actually bridges lexical and semantic retrieval.


4. Monitoring, utilities, and misc

  • picomon – Minimal terminal dashboard for monitoring AMD GPUs via amd-smi, ideal for small GPU boxes or home labs.
  • Other small utilities – This account also contains smaller experiments and tooling prototypes, chat templates etc; expect pragmatic code focused on solving a specific pain point rather than polished frameworks.

If you are running AMD‑based training or experimentation setups, picomon is usually the quickest win.


Writing, talks, and where to find me

If you are working on multilingual LLMs, low‑resource NLP, or agentic systems and want to compare notes, you are very welcome to reach out.

Pinned Loading

  1. borgllm borgllm Public

    A zero-config OpenAI client with support for 20+ providers, API key rotation, rate limits, optional LangChain integration and more.

    Python 19 3

  2. picomon picomon Public

    Beautiful TUI dashboard for monitoring GPUs (AMD, NVIDIA, Apple Silicon)

    Python 13

  3. wikilangs/wikisets wikilangs/wikisets Public

    Flexible Wikipedia dataset builder with sampling and pretraining support. Built on top of wikipedia-monthly, providing fresh, clean Wikipedia dumps updated monthly.

    Python 1

  4. zippydata/zippy zippydata/zippy Public

    A high-performance, human-readable document store for modern data workflows.

    Python 1

  5. residuals residuals Public

    A lightweight Python package implementing instruction residuals (task vectors) for efficient LLM continuous pre-training, based on the task arithmetic paradigm.

    Python

  6. semango semango Public

    🥭 Semango is a hybrid search engine that combines lexical (BM25) and semantic (vector) search. It ships with an MCP server, a simple HTTP API and optional embedded UI.

    C++