Semantic Language-Indexed Code Extraction with Backward Slicing for Repository-Scale Code Generation.
This repository contains the instructions to reproduce the Latinx in AI at Neurips 2025 paper:
The code is split across multiple repositories:
- pykagcee: A tool to create code knowledge graphs from Python projects.
- pyastran: A tool to generate descriptions and embeddings for code symbols using LLMs.
- slice: Contains the code to perform semantic search on an enriched code knowledge graph under a cli interface.
- Neo4j DBMS (local or remote). We recommend using the Neo4j Desktop application due to its better performance.
- uv tool to manage python virtual environment and dependencies.
- Chat model API to generate the descriptions.
- Embedding model API to generate the embeddings.
Create your .env. You can use the .env.example file as a template.
cp .env.example .envSet your Neo4j connection details in the .env file.
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_passwordFor the chat and embedding models, the project comes with the langchain-openai integration.
If you use other provider than OpenAI, add the integration package with uv add langchain-{provider} command.
See available providers.
CHAT_PROVIDER=openai
CHAT_MODEL=gpt-4.1-nano
CHAT_API_KEY=sk-proj-fakekey123
EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_API_KEY=sk-proj-fakekey123If you are serving an OpenAI-compatible API (e.g., with vLLM)
you can set the CHAT_BASE_URL and EMBEDDING_BASE_URL variables, keeping the provider as openai.
CHAT_PROVIDER=openai
CHAT_BASE_URL=
EMBEDDING_PROVIDER=openai
EMBEDDING_BASE_URL=You will need to set the CHAT_MAX_CONTEXT to avoid exceeding the model context length when generating
the descriptions.
CHAT_MAX_CONTEXT=3000By default, when generating the description for a symbol, we include n random related symbols
to provide more context to the model. You can configure how many related symbols to include
by setting the MAX_RELATION_CONTEXT variable in the .env file. The default value is 3.
MAX_RELATION_CONTEXT=3Install the slice command line interface (CLI):
uv tool install ./ --editableAnd that's it! You can now use the slice command.
If you have a folder with multiple Python projects, you can create a knowledge graph with pykagcee for each of them by running:
uv run --isolated pykagcee build-all /path/to/multiple/projectsImportant
The command above will have its own isolated environment, but not its onw Python interpreter.
This means that it will be constrained to the same Python version as that used for this project.
If you try to build a graph for a project that uses an older Python version, it may fail.
Install pykagcee and run uv run --python <PYTHON> pykagcee.
Next, we need to generate the descriptions and embeddings for each symbol in the graph using pyastran
uv run pyastran describe --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2
uv run pyastran embed --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2And due to a bug in pykagcee, we need to fix the file_path property on some nodes in every graph:
uv run pyastran fix-paths --all /path/to/multiple/projectsslice semanthic-search --format json --max-top-k 5 "PositionalEncodingEmbedder class is a neural network module designed to enhance input tensors with positional encoding. It utilizes frequency bands and periodic functions to transform input data" EasyVolcap Please read our contributing guidelines for more information.
If you encounter any issues, please check the following.
If you are running SLICE on WSL and your Neo4j instance is running on Windows,
ensure that you are using the correct NEO4J_URI in your .env file.
You can find your Windows IP address by running the following command inside WSL:
ip route | grep defaultYou should take the IP address after via and use it in your NEO4J_URI like so:
NEO4J_URI=bolt://<WINDOWS_IP_ADDRESS>:7687Also, ensure that your Neo4j instance is configured to accept connections from external IP addresses.
In the conf/neo4j.conf file, ensure that the following lines are set:
dbms.default_listen_address=0.0.0.0
dbms.connector.bolt.listen_address=:7687