Skip to content

miosomos/slice

Repository files navigation

SLICE

Semantic Language-Indexed Code Extraction with Backward Slicing for Repository-Scale Code Generation.

This repository contains the instructions to reproduce the Latinx in AI at Neurips 2025 paper:

The code is split across multiple repositories:

  • pykagcee: A tool to create code knowledge graphs from Python projects.
  • pyastran: A tool to generate descriptions and embeddings for code symbols using LLMs.
  • slice: Contains the code to perform semantic search on an enriched code knowledge graph under a cli interface.

Requirements

  • Neo4j DBMS (local or remote). We recommend using the Neo4j Desktop application due to its better performance.
  • uv tool to manage python virtual environment and dependencies.
  • Chat model API to generate the descriptions.
  • Embedding model API to generate the embeddings.

Installation

Create your .env. You can use the .env.example file as a template.

cp .env.example .env

Set your Neo4j connection details in the .env file.

NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password

For the chat and embedding models, the project comes with the langchain-openai integration. If you use other provider than OpenAI, add the integration package with uv add langchain-{provider} command. See available providers.

CHAT_PROVIDER=openai
CHAT_MODEL=gpt-4.1-nano
CHAT_API_KEY=sk-proj-fakekey123

EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_API_KEY=sk-proj-fakekey123

If you are serving an OpenAI-compatible API (e.g., with vLLM) you can set the CHAT_BASE_URL and EMBEDDING_BASE_URL variables, keeping the provider as openai.

CHAT_PROVIDER=openai
CHAT_BASE_URL=

EMBEDDING_PROVIDER=openai
EMBEDDING_BASE_URL=

You will need to set the CHAT_MAX_CONTEXT to avoid exceeding the model context length when generating the descriptions.

CHAT_MAX_CONTEXT=3000

By default, when generating the description for a symbol, we include n random related symbols to provide more context to the model. You can configure how many related symbols to include by setting the MAX_RELATION_CONTEXT variable in the .env file. The default value is 3.

MAX_RELATION_CONTEXT=3

Install the slice command line interface (CLI):

uv tool install ./ --editable

And that's it! You can now use the slice command.

Code Knowledge Graph Construction

If you have a folder with multiple Python projects, you can create a knowledge graph with pykagcee for each of them by running:

uv run --isolated pykagcee build-all /path/to/multiple/projects

Important

The command above will have its own isolated environment, but not its onw Python interpreter. This means that it will be constrained to the same Python version as that used for this project. If you try to build a graph for a project that uses an older Python version, it may fail. Install pykagcee and run uv run --python <PYTHON> pykagcee.

Next, we need to generate the descriptions and embeddings for each symbol in the graph using pyastran

uv run pyastran describe --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2
uv run pyastran embed --all /path/to/multiple/projects --max-concurrent-queries 50 --max-concurrent-tasks 2

And due to a bug in pykagcee, we need to fix the file_path property on some nodes in every graph:

uv run pyastran fix-paths --all /path/to/multiple/projects

Usage

slice semanthic-search --format json --max-top-k 5 "PositionalEncodingEmbedder class is a neural network module designed to enhance input tensors with positional encoding. It utilizes frequency bands and periodic functions to transform input data" EasyVolcap 

Contributing

Please read our contributing guidelines for more information.

Troubleshooting

If you encounter any issues, please check the following.

Neo4j Connection

If you are running SLICE on WSL and your Neo4j instance is running on Windows, ensure that you are using the correct NEO4J_URI in your .env file.

You can find your Windows IP address by running the following command inside WSL:

ip route | grep default

You should take the IP address after via and use it in your NEO4J_URI like so:

NEO4J_URI=bolt://<WINDOWS_IP_ADDRESS>:7687

Also, ensure that your Neo4j instance is configured to accept connections from external IP addresses. In the conf/neo4j.conf file, ensure that the following lines are set:

dbms.default_listen_address=0.0.0.0
dbms.connector.bolt.listen_address=:7687

About

Semantic Language-Indexed Code Extraction with Backward Slicing for Repository-Scale Code Generation

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •  

Languages