A streamlined web interface for converting various file formats to Markdown using the MarkItDown library from Microsoft. Transform any document into clean, LLM-ready Markdown with this powerful conversion tool.
Markdown Converter UI leverages Microsoft's MarkItDown, an open-source server that transforms virtually any document into clean, LLM-ready Markdown:
- Universal Format Support: Convert PDFs, PowerPoint presentations, Word documents, audio files, and even images into consistent Markdown
- Advanced Processing: Extracts EXIF data, performs OCR on images, generates transcripts from audio, and adds AI-generated image captions
- LLM Integration: Seamlessly prepare documents for local LLM applications like Cursor, Windsurf, Cline, and Claude Desktop
- AI Workflow Optimization: Instantly prepare data for fine-tuning and RAG (Retrieval-Augmented Generation) workflows without manual cleanup
- Scalable Document Processing: Batch support for processing multiple documents simultaneously
This tool effectively serves as an AI data engineer in your workflow, turning any knowledge base into prompt-ready content for AI assistants.
- Professional UI: Clean, modern interface with intuitive controls
- Simple Upload Interface: Drag and drop or select files for conversion
- Multiple Format Support: Convert various document formats (DOCX, PDF, HTML, etc.) to clean Markdown
- Live Preview: Instantly see the converted Markdown in the browser with vertical scrolling for large documents
- Download Options: Save the converted Markdown to your local machine
- Batch Processing: Convert multiple files at once with tab interface
- Error Handling: Clear feedback when conversion issues occur
- Large File Support: Process files up to 50MB with progress indicators
- Automatic Cleanup: Temporary files are automatically removed after 2 hours
- Python 3.8+
- Streamlit
- MarkItDown library from Microsoft with extended dependencies
-
Clone this repository:
git clone https://github.com/ajitpal/markdown-converter-ui.git cd markdown-converter-ui -
Create and activate a virtual environment:
# Create virtual environment python -m venv venv # On Windows .\venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install the required packages:
pip install -r requirements.txt # Install MarkItDown with all optional dependencies (PDF, DOCX, etc. support) pip install "markitdown[all]"
-
Start the Streamlit application:
streamlit run app.py
-
Open your web browser and navigate to the URL displayed in the terminal (typically http://localhost:8501)
-
Use the application:
- Upload one or more files using the file upload interface
- Adjust conversion settings in the sidebar if needed
- View converted files in the preview tab
- Download the converted Markdown files using the download buttons
- Use the clean, structured Markdown with your favorite LLM tools
-
LLM Integration:
- Feed the converted Markdown directly into LLM applications
- Use for training data preparation in fine-tuning workflows
- Build RAG systems with the consistently formatted content
- Create knowledge bases that are instantly AI-ready
Run the app locally as described in the Usage section above.
- Push your code to GitHub
- Visit Streamlit Cloud
- Connect your GitHub repository
- Deploy the app with the following settings:
- Main file path:
app.py - Python version: 3.8 or higher
- Requirements:
requirements.txt - Advanced settings > Packages:
packages.txt
- Main file path:
Important Note for Streamlit Cloud:
If you encounter PDF conversion errors like MissingDependencyException, ensure that:
- Your
requirements.txtincludesmarkitdown[all]>=0.1.0(not justmarkitdown>=0.1.0) - You have a
packages.txtfile with the necessary system dependencies:poppler-utils tesseract-ocr libreoffice ffmpeg - If issues persist, you may need to use the Streamlit secrets management to set environment variables for the PDF processing libraries.
- Create a Dockerfile in the project root:
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies for PDF processing
RUN apt-get update && apt-get install -y \
poppler-utils \
tesseract-ocr \
libreoffice \
ffmpeg \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Explicitly install MarkItDown with all dependencies
RUN pip install "markitdown[all]"
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py"]- Build and run the Docker container:
docker build -t markdown-converter-ui .
docker run -p 8501:8501 markdown-converter-ui- Access the application at http://localhost:8501
For cloud deployments on AWS, Azure, or GCP:
- Build the Docker container as shown above
- Push the container to a container registry (ECR, ACR, etc.)
- Deploy using a service like:
- AWS App Runner
- Azure Container Instances
- Google Cloud Run
Each service will have specific steps for deployment from a container.
markdown-converter-ui/
βββ app.py # Main application entry point
βββ requirements.txt # Python dependencies
βββ src/ # Source code directory
β βββ main.py # Core application logic
β βββ config.py # Configuration settings
β βββ ui/ # UI components
β β βββ components.py # Reusable UI components
β β βββ styles.py # CSS styles
β β βββ layout.py # Page layout configuration
β βββ utils/ # Utility functions
β βββ cleanup.py # Temporary file cleanup
β βββ file_helpers.py # File handling utilities
β βββ markdown_converter.py # Markdown conversion logic
βββ static/ # Static assets
βββ tests/ # Test files
βββ docs/ # Documentation
βββ venv/ # Virtual environment (not in git)
- Fork the repository
- Create your feature branch:
git checkout -b feature/amazing-feature - Commit your changes:
git commit -m 'Add some amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
You can customize the application by:
- Adjusting the
MAX_FILE_SIZE_MBconstant in src/config.py - Modifying the CSS styles in src/ui/styles.py for UI appearance
- Changing the max height for preview sections by editing the
.preview-containerand.stCodeBlockCSS classes - Adding additional conversion options in the sidebar
- Updating the header and footer in src/ui/components.py
This project is licensed under the MIT License - see the LICENSE file for details.
- Microsoft MarkItDown for the powerful conversion library that makes documents LLM-ready
- Streamlit for the web application framework
- The AI and LLM community for inspiring tools that bridge the gap between traditional documents and AI-ready content
-
Import Errors
- If you see import errors, make sure you're running the application from the project root directory
- Ensure all dependencies are installed:
pip install -r requirements.txt - Check that your Python path includes the project root
-
File Conversion Issues
- Verify that the input file format is supported
- Check file size limits (default is 50MB)
- Ensure you have write permissions in the temporary directory
-
PDF Conversion Issues
- If you see
MissingDependencyExceptionerrors, ensure you've installed MarkItDown with PDF support:pip install "markitdown[all]" # or specifically for PDF pip install "markitdown[pdf]"
- Make sure you have the necessary system dependencies installed:
- On Ubuntu/Debian:
sudo apt-get install poppler-utils tesseract-ocr libreoffice ffmpeg - On macOS with Homebrew:
brew install poppler tesseract libreoffice ffmpeg - On Windows: Install the appropriate binaries and ensure they're in your PATH
- On Ubuntu/Debian:
- For Streamlit Cloud deployment, ensure your
packages.txtfile includes these dependencies - Check that your PDF files are not corrupted or password-protected
- If you see
-
UI Issues
- Clear your browser cache if the UI is not loading properly
- Ensure you're using a modern browser (Chrome, Firefox, or Edge recommended)
- Check the browser console for any JavaScript errors
- If the "Clean All" button is not visible after uploading files, try refreshing the page
- For large documents, use the vertical scrolling in the preview and raw sections
If you encounter issues not covered here:
- Check the application logs for detailed error messages
- Review the documentation in the
docs/directory - Open an issue on the project's GitHub repository




