IP-centric Media Data Center
Cover Story
TelecomPlus
Sep 2011
As broadcasters transition to file-based media production and large
disk-based storage, systems are becoming the fundamental media
service of the production architecture. However, media traffic
presents much more rigorous throughput requirements than
classical IT solutions. Storage components must handle gigabyte-
size files, large chunks of data in one I/O (typically up to 4MB) and
continuous streams of traffic bursts over the storage network. This
article explores various aspects of the problem
Shahid Zahid
To increase throughput, media storage solutions distribute, or
stripe, data over several distinct storage systems. Because
every server needs parallel access to every storage system,
media storage often relies on storage cluster concepts typically
used in High-Performance Computing (HPC). These clusters
employ a large number of devices, leading to complex storage
network architectures. However, while HPC clusters typically
exchange mostly small messages, media networks are
continuously loaded to full capacity, leading to network
congestion and sustained oversubscription of the switch ports.
These circumstances present a significant challenge:


How can network engineers design a scalable storage network
that can sustain the continuous throughput required by file-
based media production while maintaining high efficiency and
network use? As this article describes, the most significant
barrier is traffic interference. Previously, VRT-media lab
demonstrated that a storage cluster architecture employing
Cisco's Data Center Bridging (DCB) technology and the PAUSE
frame mechanism defined in the IEEE 802.3 x standards can
achieve higher link bandwidth use and scalability than
traditional Infini Band (IB) solutions. However, the fundamental
impediment of traffic interference remains for both 802.3x- and
IB-based clusters.


The laboratory of VRT sought to address this with priority flow
control (PFC). The VRT-media lab performed a series of
comparative tests between 802.3x- and PFC-enabled storage
clusters. Ultimately, it found that PFC eliminates traffic
interference and supports a highly scalable storage network
that sustains 100-percent efficiency.



Media storage architectures


Because media storage systems stripe data over several
storage systems, every server needs parallel access to every
storage system. Like classical IT storage area networks
(SANs), most first-generation media file systems use a single
Fibre Channel (FC) storage network to connect every file
server node with every storage controller. This leads to a
complex network topology that is ill-suited for media
environments. VRT-media lab demonstrated that under
sustained media storage traffic loads, the long traffic bursts
interfere with each other in the switch buffers and create
severe efficiency loss
, figure 1.

As shown, when multiple sources deliver long bursts of traffic to
the same destination, throughput of the source links is limited
by the bandwidth of the aggregating link. However, when a
second destination requests data from the same source
storage controller (see purple traffic flow in Figure 1), the
second destination server does not receive the full bandwidth
available at the shared source link. Because the switch port
buffers are filled with “blue” traffic, the purple flow can only
pass a data frame every time a blue packet is read by the left
destination server — a problem exacerbated by the fact that
the left destination is reading from four sources simultaneously.
Traffic interference occurs, and traffic flow to the second
destination slows. Extrapolating this effect to larger media
storage network topologies, efficiency severely deteriorates,
limiting the scalability of any FC-based media storage
environment.


DCB-based WARP cluster network


These limitations can be partially overcome by splitting the
storage network into two separate networks as shown in figure
2. This can be accomplished using IBM's General Parallel File
System (GPFS) and a Workhorse Application Raw Power
(WARP) media storage cluster consisting of storage cluster
nodes and Network-Attached Cluster Nodes (NAN). This
architecture has a much simpler topology.
DCB transport is well-suited as the cluster network for this type
of media storage architecture. DCB allows flows to be tightly
controlled and load-balanced over the links and uses the 802.3
x PAUSE mechanism to provide link-level flow control similar to
FC, creating a “lossless” environment. The result is a notable
improvement in scalability and link bandwidth use compared
with FC or even IB; however, the fundamental effects of traffic
interference remain
, figure 3.

As shown, when multiple NAN nodes read traffic from the
storage nodes, each storage node responds with large bursts
of media traffic toward each requesting NAN node (depicted as
different colors). At the Converged Network Adapter (CNA)
network interface of the storage node, the bursts are queued in
the network interface buffer. These frames are sent to the
switch (shown here as a Cisco Nexus 5000), where they end up
in a single ingress queue buffer. Because 802.3x PAUSE link
flow control is configured, the link sends a PAUSE frame to the
storage node once the high threshold of the buffer is reached,
thereby avoiding frame loss.


In this example, three different NAN nodes are reading frames
out of this buffer and also from the other storage nodes. This
limits the total reading bandwidth on this port to only 75 percent
of the incoming traffic throughput. Hence, the buffer fills up,
and the PAUSE mechanism kicks in. If, because of the bursty
nature of the traffic per flow, the filling of the switch port buffer
is not equally distributed over the three different “colors,” one
of the colors (or traffic flows) can be depleted by the
simultaneously reading NAN nodes before the buffer reaches
its low threshold and unpauses the link. When this happens, no
frame from the depleted color is available, resulting in a “read-
miss” of the NAN node and a drop in efficiency. The issue
continues until the link is unpaused and frames of the missing
color are again provided out of the network interface queue of
the storage node. This efficiency loss can cause significant
performance degradation in the network. Fortunately, there is a
solution to this dilemma.


Priority flow control


DCB provides another, more advanced flow control
mechanism: PFC. IEEE 802.1Q defines a tag that contains a
three-bit priority field, allowing engineers to assign priorities to
different Layer 2 traffic flows. With PFC, the network can be
configured to pause traffic labeled with a specific priority (or “p-
value”) independent of the other traffic. The mechanism works
the same way as 802.3x PAUSE but selectively, per traffic
class, instead of pausing the whole link at once. Effectively,
each traffic class gains its own independent buffers and pause
mechanism. Whereas an 802.3x DCB WARP cluster will have
traffic interference at oversubscribed ports, PFC can link
different priorities to the traffic flows between two specific
nodes of the storage cluster, allowing engineers to implement
flow control for each distinctive traffic flow as shown in figure 4.
Ultimately, this solution eliminates traffic interference.


Consider again the situation shown in Figure 3, in which
multiple NAN nodes read traffic from the storage nodes and
each storage node responds with large bursts of media traffic
toward each requesting NAN node (again depicted as different
colors). This time, however, each flow between a distinctive
source-destination pair is labeled with a different priority value.
With PFC activated on the CNA, each p-value-labeled flow has
a separate buffer in the network interface, and the bursts are
queued into the dedicated network interface buffer for each
respective color. On the other side of the link, the Nexus 5000
DCB switch port also uses dedicated queue buffers for each p-
value, providing for separate sending and receiving queue
buffers at both ends of the link for each color. Frames are
picked in a round-robin fashion out of the different CNA queues
and sent over the link, where they fill up their respective
ingress queue buffers of the switch port.


In Figure 4, three different NAN nodes read frames out of the
buffers for their respective colors, and also from the other
storage nodes, once again filling the buffers. This time, the
PFC PAUSE mechanism kicks in. Now, because each flow fills
its own buffer independently, the switch can send a selective
pause-frame to the server when necessary, pausing only one
traffic flow without interfering with others. At the same time, the
independent flow control mechanisms for each flow keep
enough frames available in the independent receiving switch
port buffers for each color. Hence, none of the streams are
depleted by the simultaneously reading NAN nodes, and the
reading links continuously operate at maximum efficiency. As
long as each storage-NAN server pair has an independent
priority value and queue, no traffic interference occurs.
Throughput scales linearly as the cluster is scaled, and the
storage cluster network achieves 100-percent efficiency.



Conclusion


These results clearly demonstrate both the substantial impact
of traffic interference on media storage networks and the
extraordinary improvements in scalability and network
bandwidth use when using PFC. In the 802.3x clusters, traffic
interference causes a performance drop of up to 40 percent
when using four NAN nodes simultaneously. The same traffic
interference and performance drop has been previously
measured in IB-based WARP clusters. When PFC is enabled,
however, no traffic interference is observed at all. (The small
performance drop when reading from four NAN nodes is
caused by the fact that the file system can't launch prefetches
for reading requests aggressively enough to overcome the
statistical response fluctuation of the storage system when
running continuously at full throttle. This effect is not observed
when writing.)


The test proved unequivocally that the PFC-enabled cluster
network can sustain 100-percent efficiency at continuous full
throttle — demonstrating ideal scalability and an optimal
storage solution for IP media environments. Windows
performance is only marginally less than Linux performance but
still displays linear scalability and almost 100-percent use of
the available bandwidth. Clearly, the PFC-enabled cluster
network outperforms similar IB-based cluster architectures in
both throughput (especially for Windows) and linear scalability.
Connect To Learn
Cover Story (Sep 2011)
TelecomPlus
Since 1999
the heartbeat of infocommunication
TelecomPlus Cover Stories