AI Compilers Meet FPGAs: A HW/SW Codesign Approach for Vision Transformers
https://www.youtube.com/watch?v=RXjw670piBA
- Note
- Summary of Changes
- FPGA System Architecture
- Getting Started: Running the Full FPGA System
- Contributors
- Original README Content
- Contributors
- License
- References
This project is a fork of the excellent ITA by Gamze İslamoğlu (gislamoglu@iis.ee.ethz.ch) and Philip Wiese (wiesep@iis.ee.ethz.ch). We have ported the original ASIC-targeted accelerator core to an FPGA platform, building a custom memory subsystem and the necessary control logic to manage the full inference dataflow. Our sincere thanks go to the original creators for their foundational work.
This fork adapts the original accelerator, which was designed for an ASIC, to a fully functional FPGA implementation. Our primary contributions were to refactor the core for FPGA synthesis and build the entire surrounding memory and control infrastructure required to run a complete inference pipeline.
The key architectural changes are:
-
Core Refactoring for FPGA Synthesis
- The original ASIC-optimized ITA core was modified to be FPGA-friendly. This involved replacing all latch-based memory elements with synthesizable flip-flops and removing the non-synthesizable clock-gating infrastructure. The core computational logic remains the same.
-
New Memory Subsystem (ITA-URAM-DMA Adapter)
- We replaced the original design's dependency on a 32-port Tightly-Coupled Data Memory (TCDM) with a custom memory controller built for FPGAs. This new adapter is the central hub of the memory system and is responsible for:
- Emulating the TCDM: It uses 32 parallel on-chip URAM banks to provide the high-bandwidth memory access required by the ITA core.
- Memory Interleaving: It maps sequential addresses across the URAM banks, allowing a single wide request from the ITA to be serviced in parallel.
- Adding a DMA Interface: It introduces a streaming interface for efficient, bulk data transfers to and from external memory (e.g., DDR), which is essential for loading model weights and input data.
- Arbitration: It manages access to the URAMs, granting control to either the ITA core during computation or the DMA during data transfers.
- We replaced the original design's dependency on a 32-port Tightly-Coupled Data Memory (TCDM) with a custom memory controller built for FPGAs. This new adapter is the central hub of the memory system and is responsible for:
-
Holistic Control and Dataflow Logic
- A multi-level control system was created from scratch to manage the entire end-to-end inference process.
- Top-Level FSM: Orchestrates the high-level sequence: initiating DMA transfers to load data, triggering the ITA core for computation, and initiating DMA transfers to write back results.
- Sequencer Module: Acts as a low-level driver for the ITA core. It translates high-level commands (e.g., "compute Q matrix") into the series of precise register writes needed to process each data tile.
- DMA Address Generator: A dedicated helper module that generates the correct address streams for the memory controller during bulk DMA transfers.
- A multi-level control system was created from scratch to manage the entire end-to-end inference process.
Below are diagrams of the top-level system and the memory controller we designed.
This diagram shows the ITA core integrated with our custom control FSM, sequencer, DMA address generator, and the ITA-URAM-DMA Adapter, forming a complete inference pipeline.
This diagram illustrates the internal logic of our memory controller, including the 32 URAM banks, the arbitration MUX, and the crossbar for routing memory requests.
This guide will walk you through setting up the Vivado project and running a simulation of our complete FPGA system. The goal is to verify the end-to-end functionality, from loading data via a streaming interface to computing with the ITA core and streaming the results back out.
- Vivado 2023.1 or a compatible version.
- A Python environment with the packages from
requirements.txtinstalled. You can set this up by following the instructions in the original documentation under the "Test Vector Generation" section.
The process involves three main stages: generating test data, parameterizing the hardware, and running the simulation.
Our testbench reads input data (weights, queries, keys, etc.) and golden results from files. You must first generate these files using the original authors' PyITA scripts.
- Navigate to the
PyITAdirectory. - Run the test generator script. The parameters you choose (e.g.,
-Sfor sequence length) define the dimensions of the Transformer attention layer being tested.# Example for generating test vectors for a sequence length of 64: python testGenerator.py -H 1 -S 64 -E 128 -P 192 -F 256 --no-bias --activation=identity- This command will create the necessary stimuli and golden reference files in the
simvectorsdirectory. The hardware you configure in the next step must be able to support the dimensions you select here.
- This command will create the necessary stimuli and golden reference files in the
We provide a Tcl script that automates the creation of the Vivado project, setting up all the necessary files and IP.
- From the root directory of the repository, run the following command in your terminal:
# This will create a 'vivado_prj' directory containing the project vivado -mode batch -source setup_vivado.tcl - Optional - Hardware Parameterization:
- If you wish to change the level of parallelism in the hardware, you can edit the
setup_vivado.tclfile before running the command. - Locate the
ITA_Nparameter. This value defines the number of parallel processing engines in the accelerator. - The script uses a default value 8, but you can change it to explore different hardware configurations (e.g., setting it to
16).
- If you wish to change the level of parallelism in the hardware, you can edit the
- Open Vivado and use "Open Project" to open the project located in the newly created
vivado_prjdirectory. - In the "Sources" panel on the left, expand the "Simulation Sources" hierarchy.
- To run the primary system simulation, right-click on
tb_ITA_FPGA_WRAPPER.svand select "Set as Top". - Click the "Run Simulation" button in the Flow Navigator pane.
This testbench instantiates our complete FPGA wrapper. In the simulation, you can observe the entire inference process: input data is streamed into the on-chip memories, the main controller triggers the ITA core for computation, and the final results are streamed back out. This is the recommended testbench for verifying end-to-end functionality.
The primary goal of the simulation is to confirm that our hardware produces bit-accurate results.
The tb_ITA_FPGA_WRAPPER.sv testbench is designed to write the final output matrix from the hardware into a file within the simulation directory. You can then compare this output file against the "golden" reference files that were created by the Python testGenerator.py script in Step 1.
A successful run is one where the hardware's output perfectly matches the golden reference file, confirming the correctness of our design.
- Gamze İslamoğlu (gislamoglu@iis.ee.ethz.ch)
- Philip Wiese (wiesep@iis.ee.ethz.ch)
- Ipek Akdeniz (Technical University of Munich) - ipek.akdeniz@tum.de
- Osman Yaşar (Technical University of Munich) - osman.yasar@tum.de
- Agustin Coppari Hollmann (Technical University of Munich) - agustin.coppari-hollmann@tum.de
- Michael Lobis (Technical University of Munich) - michael.lobis@tum.de
Click to expand the original documentation for the standalone ITA core
The Integer Transformer Accelerator is a hardware accelerator for the Multi-Head Attention (MHA) operation in the Transformer model. It targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm2 in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.
This repository contains the RTL code and test generator for the ITA.
The repository is structured as follows:
modelsimcontains Makefiles and scripts to run the simulation in ModelSim.PyITAcontains the test generator for the ITA.srccontains the RTL code.tbcontains the testbenches for the ITA modules.
- Source your Vivado License (in our case we used 2023.1)
- Clone ITA
- Get all the files and checkouts using Bender.
- Get the python test files using the PyITA as intended by the original authors. Make sure to specify ITA_N parameter while generating the data.
- Adjust the value of the ITA_N parameter accordingly inside setup_vivado.tcl
- Run the following command to setup the project:
vivado -mode batch -source setup_vivado.tcl- Run simulation.
We use Bender to generate our simulation scripts. Make sure you have Bender installed, or install it in the ITA repository with:
$> make benderTo run the RTL simulation, execute the following command:
$> make sim
$> s=64 e=128 p=192 make sim # To use different dimensions
$> target=sim_ita_hwpe_tb make sim # To run ITA with HWPE wrapperWhile running synthesis, add the following flag to synthesis settings:
$> -mode out_of_contextThe test generator creates ONNX graphs and in case of MHA (Multi-Head Attention), additional test vectors for RTL simulations. The relevant files for ITA are located in the PyITA directory.
In tests directory, several tests are available to verify the correctness of the ITA. To run the example test, execute the following command:
$> ./tests/run.shTo run a series of tests, execute the following command:
$> ./tests/run_loop.shTest granularity and stalling can be set with the following commands before running the script:
$> export granularity=64
$> export no_stalls=1To install the required Python packages, create a virtual environment. Make sure to first deactivate any existing virtual environment or conda/mamba environment. Then, create a new virtual environment and install the required packages:
$> python -m venv venv
$> source venv/bin/activate
$> pip install -r requirements.txtIf you want to enable pre-commit hooks, which perform code formatting and linting, run the following command:
$> pre-commit installIn case you want to compare the softmax implementation with the QuantLib implementation, you need to install the QuantLib library and additional dependencies. To do so, create a virtual environment:
$> pip install torch torchvision scipy pandasand install QuantLib from GitHub.
$> git clone git@github.com:pulp-platform/quantlib.gitTo get an overview of possible options run:
$> python testGenerator.py -hTo generate a ONNX graph and test vectors for RTL simulations for a MHA operation run:
$> python testGenerator.py -H 1 -S 64 -E 128 -P 192 -F 256 --no-bias --activation=identityTo visualize the ONNX graph after generation, run:
$> netron simvectors/data_S64_E128_P192_H1_B1/network.onnx- Gamze İslamoğlu (gislamoglu@iis.ee.ethz.ch)
- Philip Wiese (wiesep@iis.ee.ethz.ch)
This repository makes use of two licenses:
- for all software: Apache License Version 2.0
- for all hardware: Solderpad Hardware License Version 0.51
For further information have a look at the license files: LICENSE.hw, LICENSE.sw
ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers
@INPROCEEDINGS{10244348,
author={Islamoglu, Gamze and Scherer, Moritz and Paulin, Gianna and Fischer, Tim and Jung, Victor J.B. and Garofalo, Angelo and Benini, Luca},
booktitle={2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)},
title={ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers},
year={2023},
volume={},
number={},
pages={1-6},
keywords={Quantization (signal);Embedded systems;Power demand;Computational modeling;Silicon-on-insulator;Parallel processing;Transformers;neural network accelerators;transformers;attention;softmax},
doi={10.1109/ISLPED58423.2023.10244348}}
This paper was published on IEEE Xplore and is also available on arXiv:2307.03493 [cs.AR].

