Skip to content

Add LinkedIn scraper/chat CLI, Dockerfile, and docker-compose for RAG Gradio UI#1

Open
siva01c wants to merge 1 commit intomainfrom
codex/convert-jupyter-notebook-to-python-script
Open

Add LinkedIn scraper/chat CLI, Dockerfile, and docker-compose for RAG Gradio UI#1
siva01c wants to merge 1 commit intomainfrom
codex/convert-jupyter-notebook-to-python-script

Conversation

@siva01c
Copy link
Copy Markdown
Owner

@siva01c siva01c commented Mar 13, 2026

Motivation

  • Convert the previous Jupyter notebook into a reusable CLI Python script to scrape LinkedIn posts and provide a RAG-powered chat UI.
  • Provide a reproducible containerized environment so the scraper and Gradio UI can run consistently across systems.
  • Integrate a retrieval pipeline so scraped posts can be indexed and queried with OpenAI + ChromaDB.
  • Expose a simple developer workflow for running scraping in headless/headed modes and launching the chat UI.

Description

  • Add linkedin_tool.py which implements login (cookie save/load), lazy-loading scrolling, post extraction, RagChat (Chromadb persistent collection and OpenAI chat completions), and a Gradio chat UI with CLI subcommands scrape, chat, and all.
  • Add Dockerfile that uses python:3.11-slim, installs Chromium and required system libraries, installs Python dependencies from requirements.txt, and sets the default CMD to run the chat server with python linkedin_tool.py chat --host 0.0.0.0 --port 7860.
  • Add docker-compose.yml which defines an osint service built from the repo and a selenium standalone-chrome service, maps ports (7860 for the app and 4444/7900 for Selenium), mounts the project, and injects SELENIUM_REMOTE_URL for remote browser usage.
  • Update README.md with environment variable setup (cp .env.default .env), instructions for running scrape, chat, or all, Docker Compose usage, and notes about storage locations (cookies/, chromadb/, posts.json).

Testing

  • No automated tests were added or executed for this change.

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbd49a00f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +200 to +202
chroma_id = f"topic_{idx}"
if self.linkedin.get(ids=[chroma_id]).get("ids"):
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive Chroma document IDs from post IDs instead of indices

The indexing logic uses chroma_id = f"topic_{idx}" and skips any existing ID, which makes the persistent linkedin_posts collection silently stale whenever posts.json changes order/content or a different profile is scraped later. In those cases, rows at the same index are treated as already indexed and never refreshed, so retrieval can return outdated or wrong-person context even though a new scrape was provided.

Useful? React with 👍 / 👎.

- "7860:7860"
depends_on:
- selenium
command: ["python", "linkedin_tool.py", "chat", "--host", "0.0.0.0", "--port", "7860"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid defaulting docker-compose command to chat-only mode

The compose service always launches python linkedin_tool.py chat ..., but run_chat hard-fails when posts.json is missing, so docker compose up --build on a fresh checkout exits immediately instead of bringing up a usable stack. This default command requires pre-scraped data that new environments do not have yet, making the advertised startup flow fail unless users run scrape manually first.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant