Add LinkedIn scraper/chat CLI, Dockerfile, and docker-compose for RAG Gradio UI#1
Add LinkedIn scraper/chat CLI, Dockerfile, and docker-compose for RAG Gradio UI#1
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dbd49a00f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| chroma_id = f"topic_{idx}" | ||
| if self.linkedin.get(ids=[chroma_id]).get("ids"): | ||
| continue |
There was a problem hiding this comment.
Derive Chroma document IDs from post IDs instead of indices
The indexing logic uses chroma_id = f"topic_{idx}" and skips any existing ID, which makes the persistent linkedin_posts collection silently stale whenever posts.json changes order/content or a different profile is scraped later. In those cases, rows at the same index are treated as already indexed and never refreshed, so retrieval can return outdated or wrong-person context even though a new scrape was provided.
Useful? React with 👍 / 👎.
| - "7860:7860" | ||
| depends_on: | ||
| - selenium | ||
| command: ["python", "linkedin_tool.py", "chat", "--host", "0.0.0.0", "--port", "7860"] |
There was a problem hiding this comment.
Avoid defaulting docker-compose command to chat-only mode
The compose service always launches python linkedin_tool.py chat ..., but run_chat hard-fails when posts.json is missing, so docker compose up --build on a fresh checkout exits immediately instead of bringing up a usable stack. This default command requires pre-scraped data that new environments do not have yet, making the advertised startup flow fail unless users run scrape manually first.
Useful? React with 👍 / 👎.
Motivation
Description
linkedin_tool.pywhich implements login (cookie save/load), lazy-loading scrolling, post extraction,RagChat(Chromadb persistent collection and OpenAI chat completions), and a Gradio chat UI with CLI subcommandsscrape,chat, andall.Dockerfilethat usespython:3.11-slim, installs Chromium and required system libraries, installs Python dependencies fromrequirements.txt, and sets the defaultCMDto run the chat server withpython linkedin_tool.py chat --host 0.0.0.0 --port 7860.docker-compose.ymlwhich defines anosintservice built from the repo and aseleniumstandalone-chromeservice, maps ports (7860for the app and4444/7900for Selenium), mounts the project, and injectsSELENIUM_REMOTE_URLfor remote browser usage.README.mdwith environment variable setup (cp .env.default .env), instructions for runningscrape,chat, orall, Docker Compose usage, and notes about storage locations (cookies/,chromadb/,posts.json).Testing
Codex Task