A brief introduction to FPGAs
Table of Contents
1. Introduction
This document is meant as an introduction to FPGAs. It covers different aspects related to FPGA hardware, architecture and software tools, as well as the various application domains.
FPGAs (Field-Programmable Gate Arrays) are customizable integrated circuits consisting of a network of blocks designed to be configured programmatically post manufacturing. FPGAs can be used to implement solutions to any computable problem (architectural specialization) but they were initially designed to add glue logic between circuit board components. Nowadays, the use cases for FPGAs have broadened due to the advent of Big Data and Machine Learning, and the need for higher data processing performance and customizable hardware.
2. FPGA architecture
Every FPGA comes with a limited number of configurable logic blocks, fixed function blocks (adder, multiplier, …), and I/O blocks that a programmer can arrange into any desired configuration programmatically. The following subsections present the major FPGA components and their key characteristics.
2.1. Logic and fixed function blocks
The most common FPGA block, called configurable logic block - CLB for short, consists of logical cells that can be configured to perform combinatorial functions, to act as a logic gate (i.e. or, and, xor, …) or as a memory block (i.e. D flip-flop or a RAM block). In addition to the CLB, I/O blocks are also implemented to allow the FPGA to communicate with the outside world. All these blocks are connected through a configurable network that allows a very high flexibility and adaptability.
Below (Figure 1) is a diagram showcasing a logic FPGA cell (b) as well as the topology and placement of the main blocks (a).
Figure 1: Topology of an FPGA (a), CLB (b)
As we can observe on the figure above, a CLB contains two main elements: a LUT (Look-Up table), and a D flip-flop. Most of the CLB's logic (or, and, xor, …) is implemented as truth tables using SRAM cells in the form of LUTs. The figure below shows the process of implementing a 3-input logic combinatorial function (y = (a & b ) | !c) using a 3-input LUT. For more about how a multiplexer is implemented, see Appendix A.
Figure 2: CLB LUT programming
In addition to logic operators, most modern FPGAs offer fixed function blocks in order to enhance performance and allow programmers to use ready-made functional atomic blocks to build larger operations. This is useful for building arithmetic units such as 32-bit adders or multipliers for signal processing. For example, the Xilinx Virtex-5 FPGA comes with prebuilt multiply-accumulate floating-point units aimed at accelerating DSP algorithms (see figure below).
Figure 3: Excerpt from the Virtex-5 FPGA brochure
2.2. Phase-Locked Loop
Another key component of FPGA circuitry is the PLL (Phase-Locked Loop). PLLs are electronic circuits/components that generate an output signal related to the phase of an input signal. Using PLLs, developers can generate multiple frequency domains using an initial frequency source (i.e. a crystal oscillator). The figure below shows a rough diagram of how a PLL can be combined with a frequency divider (here the component divides the frequency by 2) in order to generate a 2000 Hz frequency from an intial 1000 Hz source frequency. For more details about PLLs, see Appendix B.
Figure 4: Signal frequency doubling using a PLL
2.3. Logic Elements
Generally, FPGA integrated circuits implement from a few thousands to millions of Logic Elements (or Logic Cells, …). But, this parameter is not usually straightforward for comparing FPGA chips given that each FPGA manufacturer defines what a Logic Element (a Logic Cell, or a slice) represents. For example, Intel defines a MAX10 Logic Element as follows:
Figure 5: Intel's definition of a Logic Element
This definition may not be accurate (or even wrong) when discussing another FPGA chip. In response to this lack of a standard definition, the general consensus amongst developers is to use the number of LUTs as a metric for comparing FPGA chips. The more LUTs, the more complex the developer's designs can be. Besides the number of LUTs, other features such as the number and type of arithmetic fixed function blocks, and the number of I/O blocks are crucial in determining if an FPGA chip is the right choice for a certain design or not.
2.4. Parallelism
Parallelism is the ability of a processing unit to operate on multiple blocks of data simultaneously by using a large number of concurrent computing resources. For example, multi-core and many-core CPUs, as well as GPUs, contain multiple independant physical cores running in parallel. Contrary to CPUs, GPUs use a very simple core design allowing for the integration of thousands of cores in a SIMD fashion. SIMD (Single Instruction Multiple Data) is an architectural concept based on processing data by blocks (also called vectors or packets). The figure below (Figure 6) shows the difference between a scalar and a SIMD multiplication:
Figure 6: Scalar vs. SIMD
There are different types of parallelism implemented in different chip architectures. Figure. 7 shows the different parallel architectures and how effective they are at exploiting fine-frain paralleslim. In general, each architecture compromises on fine-grain parallelism to accomodate a certain type of compute pattern or workload. Clearly, FPGAs represent the best option for handling parallel or concurrent applications given the large number of logic and I/O blocks available and the possibility to organize them as needed.
Figure 7: Chip architectures and fire-grain parallelism
The following figure (Figure 8) shows the difference between a scalar DSP unit implementing a dot product operation for 256 input values, and the same operation implemented on a FPGA. We can clearly observe that the parallel nature of the FPGA architecture allows data to flow in independently from multiple I/O blocks (or data lanes) straight into multiple multipliers which then forward the results to an adder that performs the final reduction. The whole process costs around 1 cycle. On the other hand, the DSP performs the operations sequentially requiring 256 cycles to process the whole set of input values. In this case, the FPGA allowed an efficient implementation of the dot product kernel using SIMD parallelism to speed up the execution by a considerable factor: 256x.
Figure 8: DSP vs. FPGA execution
2.5. Form factors
FPGA chips come packaged in different form factors depending on the application or use case. They can be directly integrated into the circuit board or added as an external device through a high speed connector (Die-to-die interconnect, PCIe, Network, …). In the industrial world, developers generally use development boards and PCIe extension cards to test and validate their prototypical designs. The following sub sections cover the most popular FPGA form factors (devboards and PCIe extensions) and their use cases.
2.5.1. Development boards
Development boards (or devboards) offer a large range of developer-friendly interfaces and tools for rapid prototyping, testing, debugging and tracing. For example, the Digilent Arty Z7 devboard shown below (Figure 9) offers two HDMI ports (input and output), a USB port, an Ethernet port, an audio port, and a large set of GPIO ports alongise an SD card reader. This board comes with a dual-core ARM Cortex-A9 CPU running at 650 MHz and tightly integrated to the Xiling FPGA that can be easily programmed/debugged using the micro-USB connector. This board was mainly designed for hobbyists desiring to implement retro-computer architectures (Z80, 6502, …) and run retro games on more efficient modern hardware.
Figure 9: Digilent Arty Z7 FPGA devboard
Another interesting devboard is the Terasic Intel Pathfinder for RISC-V designed for promoting RISC-V software and hardware development. This board comes with a Cyclone IV FPGA containing 114,480 Logic Elements, 3,888 Kbits of embedded memory, and 128 MB of SDRAM. It offers a slew of I/O capabilities for embedded systems development: 2 Ethernet ports, 4 USB ports, TV decoder, RS232 port, …
Figure 10: Terasic Intel Pathfinder for RISC-V board
For years, FPGA devices were very expensive and only accessible to those with the financial means to invest in the hardware and the software stack. Nowadays, with the democratisation of electronics, FPGA boards have become more accessible and affordable to the general public. For example, Lattice, an embedded semiconductor and FPGA manufacturer, proposes multiple affordable boards for students, universities, and hobbyists for less than 50$. The board shown below in Figure 11 is the Lattice Icestick HX1K, it costs 48$, is very low-power and comes with 1K LUTs, 5 LEDs, 12 GPIO pins, and an infrared receiver/transmitter.
Figure 11: Lattice Icestick HX1K
For more devboards, you can visit the following links:
2.5.2. PCIe cards
FPGAs come also in a PCIe card format mainly used for acceleration. These cards can be added to any computer system in order to extend its compute capabilities with custom made designs mainly aiming at accelerating certain compute workloads.
Intel, as well as AMD (Xilinx), offer multiple PCIe FPGA accelerator cards. Below are two examples, the Arria 10GX and the AMD (Xilinx) Alevo U200/U250.
Figure 12: Intel Arria 10GX accelerator card
Figure 13: Xilinx Alveo U200/U250
For more information about the Arria 10GX FPGA and the Alveao U200/U250, you can visit Intel's and Xilinx's references:
- https://ark.intel.com/content/www/us/en/ark/products/210381/intel-arria-10-gx-1150-fpga.html
- https://www.xilinx.com/content/dam/xilinx/support/documents/data_sheets/ds962-u200-u250.pdf
The PCIe form factor is mainly used in HPC (High Performance Computing) solutions for accelerating certain specific code patterns with custom hardware. Modern HPC systems using FPGAs for acceleration can be customized programmatically to better fit the computational needs of the target workload. More details on the use of FPGAs in HPC are discussed in the next section.
3. Applications
In the mid to late 1980s, FPGAs were mainly used for allowing circuit board components to communicate while using different communication protocols. For example, an FPGA could be used to interface a device using RS232 with another device using SPI by translating the commands/requests from one protocol to another. This feature was primarily useful for telecommunications and networking devices. Fast forward a decade, FPGAs saw a massive growth in production volume and circuit sophistication, and by the end of 1990s, they had found their way into consumer products as well as automotive and industrial applications.
In this section, we will present some of the key applications that benefited greatly from the advent of FPGAs and their democratization.
3.1. Digital Electronics Design
Digital electronics design is known to be the first discipline to benefit from the high configurability and flexibility of FPGAs. FPGAs allowed electronics engineers to implement, test, and validate their designs before building ASIC (Application Specific Integrated Circuit) chips, enhancing greatly the quality of the designs and pace at which they are updated.
3.2. Digital Signal Processing
Digital signal processing is another field that benefited greatly from FPGAs. Many electronic devices, such as sound cards, were embedding FPGAs within their circuit boards to implement noise filters, and other signal/sound processing algorithms. Generally, for performance reasons, some signal processing primitives are implemented using DSPs. But in the last few years, FPGA technology has evolved
3.2.1. Sound processing
The advent and evolution of electronics brought numerous devices to the music recording industry. From instruments (like synthetizers) to elaborate studio sound cards and effect pedals, FPGAs have always been part of the circuitry. Most modern studio equipment relies heavily on FPGAs in order to implement certain sound processing algorithms such as equalization, noise filtering, clipping, etc. Usually, such equipment is prototyped using special audio development boards similar to the ones shown below (Coveloz BACH, Audiopraise XMOS). As you can observe, both boards have numerous and diverse audio I/O ports as well as an FPGA. The Coveloz BACH board uses a SoM (System-on-Modue) Cyclone V FPGA from Altera, and the Audiopraise XMOS a Spartan 6 from Xilinx.
Figure 14: The Coveloz BACH v1 audio analysis board
Figure 15: The Audiopraise FPGA XMOS v1 audio board
At the industrial level, multiple vendrors offer different types of hardware. For example, AVID proposes a full sound processing stack for studio recording. Below, is a excerpt from the HDX PCIe card brochure.
Figure 16: AVID HDX PCIe sound processing card
This card boasts two powerful FPGAs, 18 DSP processors, and 64 input/output channels in a compact PCIe format allowing sound engineers apply filters and effects live with close to 0 processing latency on multiple input and output channels (https://www.avid.com/products/pro-tools-hdx).
3.2.2. Image/Video processing
Video and image processing is another field where FPGA integration was crucial to the development of modern high quality video recording and processing devices. Many of the devices used by TV production studios, Special Effects studios, Industrial Machine Vision, Medical Imaging, … make extensive use of FPGAs in order to acquire and process video images.
Intel has a very interesting set of articles around the use of FPGAs for Machine Vision and Medical Imaging:
- https://www.intel.com/content/www/us/en/industrial-automation/products/programmable/applications/machine-vision.html
- https://www.intel.com/content/www/us/en/healthcare-it/products/programmable/applications/diagnostic-imaging.html
- https://www.intel.com/content/www/us/en/healthcare-it/products/programmable/overview.html
- https://www.intel.com/content/www/us/en/products/docs/programmable/ai-fpga-whitepaper.html
3.3. Cryptography & Network Security
FPGAs have also been extensively used to implement cryptographic primitives in cases where updatability and security were of major importance. Using FPGAs to implement cryptographic algorithms allows for the algorithms to be updated - with a firmware update - when vulnerabilities have been detected. For example, when the hashing algorithms SHA1 and MD5 were broken, many devices using ASICs to implement these primitives were deemed insecure and had to be physically replaced. On the other hand, the devices using FPGAs only needed a firmware update in order to switch to a more secure hashing algorithm. The same happened when the Data Encryption Standard (DES) was broken. Modern systems embed a mix of ASIC cryptochips and FPGAs to implement cryptographic routines.
3.4. Artificial Intelligence
3.5. BioInformatics
3.6. HPC
In the last 4 to 5 years, the HPC industry has been growing at a very large pace due to the proliferation of data and compute-intensive applications. Underlying this growth has been a great need for better hardware with lower latency and higher efficiency. This led to the emergence of hardware accelerators to a point where ~80% of today's HPC systems use some form of accelerator (GPU, FPGA, TPU, …). While GPUs have become very popular as accelerators, they are not always the best choice. The primary reason behind the successful integration of GPUs within the HPC stack is parallelism. However, FPGAs are an attractive option capable of delivering equal, if not better performance compared to GPUs due to the fixed architecture of the GPU supporting only SIMD parallelism and requiring batching of the data in order to take advantage of the fixed architecture. A limitation not shared by FPGAS.
FPGAs are configurable hardware consisting of thousands of logic/compute blocks that a programmer can arrange into any computational construct. With a large number of computing resources available, FPGAs can be configured to support running multiple concurrent programs each processing multiple data blocks concurrently. Therefore, FPGAs can operate on large amounts of data from multiple sources simultaneously (video or audio streams, streams from multiple storage devices or memories, …).
4. Programming for FPGAs and software tools (WIP)
4.1. Programming languages
4.1.1. VHDL
4.1.2. Verilog
4.1.3. OpenCL
4.2. Software tools
4.2.1. Open Source
4.2.2. FloPoCo
4.2.3. ForwardCom
5. References
6. Appendix A
The following diagram shows an implementation of a digital 8 to 1 (8:1) multiplexer using logic gates. This multiplexer takes 3 selection inputs (s(0), s(1), s(2)) and outputs the selected value from a set of 8 inputs (I(0) to I(7)).
Figure 17: Digital 8:1 multiplexer
7. Appendix B
The following is a circuit diagram showcasing how a PLL is implemented.
Figure 18: PLL circuit implementation