Features

- Single QSFP+ socket:
  - 4 ports 10GbE LAN/WAN using SFP+ modules OR
  - 1 port 40 GbE

- Dual SFP+ sockets for 10GbE

- Hosted in an 8-lane GEN1/GEN2/GEN3 PCIe slot

  - Compatible with Xilinx PCI Express Solutions
  - Compatible with Northwest Logic PCI Express Solutions
  - 16-lane mechanical
  - Short length form factor

- Fully compatible with our optional TCP Offload Engine (TOE/TOE128)

- Optional FIX board support package (DN_FBSP).

  Functioning reference design with:
  - 10 GbE MAC and 40 GbE MAC
  - TCP/IP Offload Engine (TOE)
  - Up to 128 sessions
  - FIX protocol parser
  - PCIe Interface (8-lane, GEN3)

- Memory
  - RLDRAM 3 Controller
  - DDR4 Controller

- Xilinx Kintex UltraScale+/UltraScale FPGA (A1156)
  - UltraScale
    - KU15P-3, -2L, -1 (fastest to slowest)
    - KU11P-3, -2L, -1 (fastest to slowest)
  - UltraScale+
    - XCKU095-2, -1
    - XCKU060-3, -1, -1
    - XCKU040-3, -2, -1
    - XCKU035-3, -2, -1
    - XCKU025-3, -2, -1 (note reduced functionality)

- 6M ASIC gates (ASIC measure) when stuffed with XCKU095
- 537k flip-flop/6-input LUTs (1M total FFs)
- 7,560 Kbytes total FPGA block memory (3,360 - 18 kbit blocks)
- 768 multipliers: 27 x 18

- DDR4 Memory, 4GB (not available on KU15P/KU11P)
  - 72-bit data width (64-bit with 8-bit ECC)
  - 1200MHz operation, PC4-2400 with -2 or faster speed grade
  - DDR4 interface compatible with Vivado MIG

- RLDAM3 configured as 16M x 72 (144MB)
  - 16M x 36 with KU15P/JU11P
  - 1066MHz with -2 or faster speed grade

- SMBus-based thermal management

- GPS input for precise message time stamping and tracking
  - RS233, RS422, or RS485 interface

- Full support for embedded logic analyzers via JTAG interface
  - ChipScope, Exostiv, and other third-party debug solutions

- Eight FPGA-controlled LEDs
  - Enough debug LEDs to illuminate a small Koi pond.

### Table 1: FPGA Resources

<table>
<thead>
<tr>
<th>Kintex UltraScale+ (A1156)</th>
<th>Speed Grades (slowest to fastest)</th>
<th>FF's</th>
<th>Gate Estimate</th>
<th>Multipliers (27x18)</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>Max (100% util)</td>
<td>Practice util</td>
<td>Blocks (UltraRAM 4k x 72bits)</td>
</tr>
<tr>
<td>KU15P</td>
<td>-1, -2, -3</td>
<td>1,045,000</td>
<td>10,032</td>
<td>6,020</td>
<td>1,968</td>
</tr>
<tr>
<td>KU11P</td>
<td>-1, -2, -3</td>
<td>597,000</td>
<td>5,731</td>
<td>3,440</td>
<td>2,928</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Kintex UltraScale (A1156)</th>
<th>Speed Grades (slowest to fastest)</th>
<th>FF's</th>
<th>Gate Estimate</th>
<th>Multipliers (27x18)</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>Max (100% util)</td>
<td>Practice util</td>
<td>Blocks (UltraRAM 4k x 72bits)</td>
</tr>
<tr>
<td>KU095</td>
<td>-1, -2</td>
<td>1,075,200</td>
<td>10,322</td>
<td>6,190</td>
<td>768</td>
</tr>
<tr>
<td>KU060</td>
<td>-1, -1L, -2, -3</td>
<td>663,360</td>
<td>6,386</td>
<td>3,820</td>
<td>2,760</td>
</tr>
<tr>
<td>KU040</td>
<td>-1, -1L, -2, -3</td>
<td>484,800</td>
<td>4,654</td>
<td>2,790</td>
<td>1,920</td>
</tr>
<tr>
<td>KU035</td>
<td>-1, -1L, -2, -3</td>
<td>406,256</td>
<td>3,900</td>
<td>2,340</td>
<td>1,700</td>
</tr>
<tr>
<td>KU025</td>
<td>-1, -2</td>
<td>290,880</td>
<td>2,792</td>
<td>1,680</td>
<td>1,152</td>
</tr>
</tbody>
</table>
Block Diagrams

**Ultrascale+**

**Ultrascale**

---

**DINI group**
KU025

4x 10GbE or 1x 40GbE

FPGA
UltraScale
KU025
(FFVA1156)

QSFP+

Time Sync (RS422 / RS232 / RS485)

GPS

EEProm

Flash SPI

DDR4
512Mx72
PC4-2400

Config
Flash
(1 Gbit)

USB 2.0
(Type B)

USB JTAG

FPGA

JTAG

8-Lane GEN3 PCIe

DNPCIe_40G_KU_LL (KU025)
Daughter of Godzilla's Bad Hair Day
Block Diagram
v1.10
Description

Overview

The **DNPCIE_40G_KU_LL_2QSFP** is a PCIe-based FPGA board designed to minimize input to output processing latency on 10-Gbit or 40-Gbit Ethernet packets. The primary application is for low-cost, low latency, high throughput trading without CPU intervention. Every possible variable that affects input to output latency has been analyzed and minimized. Raw 10 or 40 GbE Ethernet packets can be analyzed and acted upon without a MAC, interrupts, or an operating system adding delay to the process. This configurable hardware computing platform has the ability to achieve the theoretical **minimum** Ethernet packet processing latency.

The FPGA – Xilinx Kintex UltraScale+/UltraScale

We use a single FPGA from the Xilinx Kintex UltraScale+/UltraScale family in the A1156 package. This package supports 520 I/Os with the majority utilized. Most are dedicated to off chip memory peripherals including dual RLDRAM III’s for low-latency high speed look-up, and a bank of DDR4 memories for performance oriented bulk storage. The Kintex UltraScale/UltraScale+ FPGA contains high-speed transceivers capable of 16GbE without need for an external PHY. Eight of these transceivers are used for an 8-lane GEN3 PCIe interface with Kintex UltraScale+ capable of GEN4. A single set of 4 GTH transceivers is connected to a QSFP+ socket for 40GbE Ethernet (or 4 channels of 10 GbE). Two GTH transceivers are connected to SFP+ sockets for 10 GbE. Twelve addition GTH transceivers are attached to connectors and can be used for high speed board to board communication using cables, but note some reduced functionality when using the KU040/035/025.

Two possible UltraScale+ FPGAs can be stuffed: KU15P and KU11P. Five possible Kintex UltraScale FPGAs can be stuffed (largest to smallest): KU095, KU060, KU040, KU035, or KU025. These FPGAs come in a variety of speed grades (-2/2L, -3) with -3 the fastest. -2 or faster might is required to achieve the highest clock rates on the RLDRAM III and DDR4 interfaces. Table 1 depicts the resources of the FPGAs with the Xilinx marketing exaggerations excised. These are large, but low-cost FPGAs. The KU095 is capable of handling ~6M ASIC gates of logic and remember that the internal FPGA memory and multiplier blocks are not part of this number. UltraScale+ adds large blocks of internal RAM (UltraRAM). Features of the Kintex UltraScale/UltraScale+ FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and third generation DSP slices (includes 27 x 18 multipliers and 48-bit accumulator). Floating point functions can be implemented using these DSP slices. The largest device, the KU095 contains the fewest number of multipliers.
**Low Latency Network Interface**

1 channel of 40 GbE or 6 channels of 10 GbE via QSFP+ and dual SFP+

The Kintex UltraScale/UltraScale+ FPGA has transceivers capable of 16 GbE. The physical interface (PHY) is handled using a single QSFP+ module for 40 GbE. With the proper cable this can be split into 4 separate channels of 10 GbE. Raw Ethernet packets can be accessed directly by bypassing the MAC.

**Memory**

**RLDRAM 3 – Memory designed for low latency**

We use dual 16M x 36 RLDRAM 3 devices (576Mbit) resulting in a 16M x 72-bit bank. Due to banking issues, the Kintex UltraScale KU15P/KU11P RLDRAM3 is 16M x 36. This type of memory has shared input and output data paths. RLDRAM III is optimized to minimize the time between the beginning of an access cycle and the instant that the first data is available, making it attractive for network packet processing applications and data center acceleration. The maximum tested frequency of this memory is 1066 MHz (assuming a -2 speed grade FPGA or faster). To minimize processing latency, we suspect it will be best to clock this RLDRAM3 SRAM at 937.50 MHz which exactly six the internal Ethernet controller frequency of 156.25 MHz. The Kintex UltraScale/UltraScale+ FPGAs are capable of generating internal 6x clocks that are phase synchronous, eliminating the latencies associated with the tricky re-synchronization of data moving between different clock frequencies. The internal controller can be optimized in any way you choose. We, of course, provide several Verilog examples using the Xilinx MIG that you are welcome to use. All functions of the Micron RLDRAM 3 can be exploited. The only real limitation is the amount of time and effort spent in customizing the individual memory controller.

**DDR4 – 4GB of local bulk memory**

Nine PC4-2400 DDR4 chips are mounted on the card, providing 4GB of DDR4 memory. DDR4 is not available with the KU15P/KU11P. The memory configuration is 512M x 72. Using a -2 or -3 speed grade FPGA, this memory bank is tested at the maximum FPGA I/O frequency: 1200 MHz (2400 Mb/s with DDR).
To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR4 interface at a 7x multiple of the base Ethernet frequency of 156.25 MHz, which is 1093.75 MHz. A 9x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR4 memory controller. The DDR4 controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the DDR4 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR4 memory utilized. As with the RLDRAM III, the only real limitation is the amount of time and effort spent customizing the DDR4 memory controller to your needs.

**PCIe – Customizable 8-lane, GEN3 PCI Express**

PCIe is connected directly to the FPGA via 8-lanes of GTH transceivers. Note that the board has a 16-bit mechanical finger for stability. The interface is fully GEN2 and GEN3 capable, with GEN4 on the KU15P/KU11P. We ship GEN3 PCIe IP that is a full function, fixed, 8-lane master/target. To gain access to the PCIe interface, this IP must be integrated with your application. The Dini Group PCIe IP provides a flexible interface that allows the user access to multiple DMA engines, scratchpad memories, interrupts, and other endpoint-related functions to maximize performance while utilizing minimal FPGA resources. Drivers (required) for ‘C’ source for several operating systems are included no charge.

**How Everything Works …**

With direct data feeds such as NASDAQ ITCH and OUCH, the DNPCIe_40G_KU_LL contains all of the basic functions required to minimize the amount of time it takes to receive Ethernet packets, process them, and respond deterministically. By using the FPGA to process Ethernet packets, the processor and operating system are removed from the critical path and traditional sources of latency such as interrupts and context switching no longer hinder performance. Not a single clock cycle. For algorithms requiring processing, FPGA resources can be hard coded to perform the task. This includes real-time Monte Carlo analysis, and floating point. For algorithms requiring processing, FPGA resources can be hard coded to perform the task. This includes real-time Monte Carlo analysis, and floating point.
Photos