Quantcast
Channel: Recent posts
Viewing all 77 articles
Browse latest View live

Libcryptorandom

$
0
0

Downloads

Libcryptorandom [PDF 398KB]
Libcryptorandom Source Code[ZIP 376KB]

Libcryptorandom is a cross-platform library that allows programmers to obtain cryptographically secure random numbers from the best available entropy source on the underlying system. The library frees the programmer from having to understand and code for various OS-specific crypto implementations and/or hardware devices. The calling program merely specifies what grade of random bytes are needed, and the library returns a random number provider that will satisfy the request (if available).

This library supports Intel® Secure Key.

Underlying sources of random numbers are referred to as providers. The library chooses the best available provider from the list of defined providers that will satisfy the request. "Best" is a somewhat subjective term, but the intention is to favor high-throughput sources over lower ones, and high-quality hardware devices over OS implementations.

Linux*, OS X*, and Windows* operating systems are supported, in both 32- and 64-bit builds.

API Overview

The API includes the following functions:

int random_open(crypto_random_t *provider,unsigned int flags);

Open a random provider that meets the requirements specified in flags.

int random_close(crypto_random_t*provider);

Close an open provider and free its resources.

int random_info(crypto_random_t*provider,int parameter,void*value);

Obtain information about a provider, or about all known providers.

ssize_t random_read(crypto_random_t*provider,void *buf, ssize_tlen);

Read random bytes from an open provider.

int random_reseed(crypto_random_t*provider);

Explicitly force a reseed of the underlying random provider.

const char *random_strerror(interrornum);

Obtain an error string from an error code.

The random providers known to libcryptorandom are:

OS

The OS facility for obtaining cryptographically secure random numbers. On Linux and OS X this would be the /dev/random and /dev/urandom devices. On Windows, random numbers come from the CryptGenRandom() function in Microsoft's CryptoAPI.

DRNG

Intel Corporation's digital random number generator, marketed under the name Intel® Data Protection Technology with Intel Secure Key. For more information, see  DRNG Software Implementation Guide.

Building and Installation

Libcryptorandom is distributed as source code and must be built on the target platform.

Linux

Builds and installs via Gnu Autotools, using either gcc or the Intel® compiler. The build target is a shared library, libcryptorandom.

OS X

Same build procedure as Linux.

Windows

Builds via Visual Studio*, using either the Microsoft or Intel compiler. The build target is a static library.

Licensing

Libcryptorandom is an open source library distributed under the terms of the BSD 2.0 license. The license text is included in the distribution.

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. 

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.


Location Data Logger Design and Implementation, Part 7: The accuracy circle user control

$
0
0

This is part 7 of a series of blog posts on the design and implementation of the location-aware Windows* Store app "Location Data Logger". Download the source code to Location Data Logger here.

The Accuracy Circle User Control

Many, if not most, internet mapping applications and consumer navigation systems don't just display the user's position on the map, they also give the user a visual indicator of the estimated accuracy of the position information. This typically takes the form of a circle centered on the current position with a radius corresponding to the best-guess of accuracy, hence the colloquial term "accuracy circle". The point of the accuracy circle is to tell the user "I think I'm here, but it's possible I'm somewhere else within this circle". The larger the circle, the less confidence there should be about the position report.

Before I go in to the specifics of how an accuracy circle was implemented in Location Data Logger, however, I want to talk a little bit about the term "accuracy", what it means, and how it's calculated.

An aside: What is accuracy?

People use the term "accuracy" a lot in geolocation applications but there are a lot of misconceptions about what it really represents in the physical world, and even more misconceptions about how accurate accuracy really is.

The first rule of accuracy is that accuracy is a lie.

Maybe that's being a little harsh, but it's a statement that I am going to stand behind. What consumer devices report as "accuracy" is not really an accuracy at all, but rather something people in the GNSS world refer to as the "estimated position error", or EPE. The most important word in that term is "estimated". The harsher truth sitting behind all of this is that it is just not possible for a device to determine its own accuracy because there are simply too many factors that impact it, and virtually none of them (such as atmospheric conditions, and multipath and reflection issues) are knowable to the device. The EPE reported by your device is really just a guess, and there are no standards for how that guess is made. Each device manufacturer has their own algorithm for determining EPE based on what the device itself knows about the signals it is receiving, and that value is reported to the upper layers (in our case, to the Windows Geolocation sensor) as "accuracy".

The only figure that a device can actually know about it's own accuracy is something called the dilution of precision, or DOP. This is a value that can be mathematically derived from the number of satellite signals that are being tracked, and their location in the sky relative to the device, sometimes referred to as the satellite geometry or configuration. The more satellites a device can see, and the more spread out the satellites are relative to the observer, the lower the DOP.

The problem with DOP is that it's a unitless number that represents the multiplicative effects of satellite geometry on measuring position. While it has some intuitive value—lower numbers are good, and higher numbers are bad—it isn't a number you can give to a user and have them translate to something more tangible like feet or meters.

A device manufactuer's EPE is an attempt to take DOP and turn it in to something more useful to the user, but these methods are imperfect, they can vary wildly from device to device, and in many cases they are even nonsinsical (I once worked with a device that always reported an accuracy in 4m increments, and another that always said its accuracy was 30m no matter what). This means you shouldn't just take the Accuracy property from the Windows Geolocation sensor with a grain of salt: you may need the whole shaker.

Building the accuracy circle

Despite the fact that Accuracy property reported by the Geolocation sensor is a reliable measurement of a device's accuracy, it is hardly useless and there is some value in plotting it on the map with an accuracy circle. To do this, we need two things:

  1. A user control that will hold the code and XAML to draw the circle
  2. Some math to scale the circle with the scale of the map

The code for first user control is in MapAccuracyCircle.xaml and MapAccuracyCircle.cs.

The user control

For the user control, we want to display three things: the user's position with a dot, an accuracy disc that is shaded and semi-transparent, and a border around the accuracy disc as show below.

To implement this, I went with the canvas in XAML and defined three objects:

<UserControl
    x:Class="Location_Data_Logger.MapAccuracyCircle"    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">
    <Canvas>      
        <Ellipse x:Name="AccuracyCircle" Fill="CornflowerBlue" Opacity="0.20"/>
        <Ellipse x:Name="AccuracyCircleBorder" Opacity="0.8" Stroke="CornflowerBlue" StrokeThickness="2"/>
        <Ellipse Height="20" Width="20" Fill="CornflowerBlue" Margin="-10,-10,0,0"/>
    </Canvas>
</UserControl>

Note that the drawing order here is important: the semi-transparent disc must be rendered first, followed by the position dot and border. To simplify the rendering of the overlay on the map, the elements should be centered at 0,0 on the canvas. A width and height of the accuracy circle itself and its border are not specified, since they will be set in the application as the accuracy changes. The size of the position dot, however, is fixed at a 10 pixel radius and the Margin property is used to center it at our origin.

The accuracy circle must be updated whenever on of the following occurs:

  1. The device's postion changes
  2. The map view changes

The former is done via the setErrorRadius method in the MapAccuracyCircle object:

public void setErrorRadius(double errorRadius)
{
    if (errorRadius >= 0) radius = errorRadius;
    else return;

    UpdateAccuracyCircle();
}

It is called from within update_position() in the MainPage object.

The latter is done with a handler on the ViewChanged event which is set up when the object is initialized:

private Map _map;

public MapAccuracyCircle (Map map)
{
    this.InitializeComponent();

    _map = map;

    // Set an event handler to update the control when the map view changes

    _map.ViewChanged += (s, e) => { UpdateAccuracyCircle(); };
}

Both of these methods call UpdateAccuracyCircle(), which is responsible for setting the radius of the accuracy circle on the map display.

The math

To draw a circle on the map, we have to turn the accuracy measurement, which is in meters, into an appropriate number of pixels. This requires us to know how many meters are represented by a single pixel in the map display. The Bing* Maps object does not provide a method to determine this so we need to determine it ourselves. This is a rather simple calculation because Microsoft* uses the Mercator projection for Bing Maps, and they even provide the necessary formula in their online article "Bing Maps Tile System":

meters_per_pixel = Math.Cos(_map.Center.Latitude * DEG_TO_RAD) * CIRCUMFERENCE_EARTH / (256 * Math.Pow(2, _map.ZoomLevel));

Once the meters per pixel are known, it's trivial to determine the circle radius in pixels by dividing our current accuracy (error radius) by that value:

pixels = radius / meters_per_pixel;

    AccuracyCircle.Width = AccuracyCircleBorder.Width= AccuracyCircle.Height = AccuracyCircleBorder.Height= pixels;

The circles are centered by setting the Margin property to offset them. 

AccuracyCircle.Margin = AccuracyCircleBorder.Margin=
    new Windows.UI.Xaml.Thickness(-pixels/2, -pixels/2, 0, 0);


← Part 6: The Export Class

Intel® Data Protection Technology with Secure Key in the Virtualized Environment

$
0
0

The digital random number generator (DRNG) behind Intel® Data Protection Technology with Secure Key provides high-quality random numbers that are accessible via the CPU instruction RDRAND. This easy-to-use feature is of great benefit to virtualized environments where limited system entropy must be divided up among a large number of virtual machines. Secure Key’s extremely high data rates—measured in the hundreds of MB/sec—combined with its accessibility via a single CPU instruction ensures that it can supply sufficient entropy to all of the virtual machines on a single system, even under a heavy load, without fear of starving any of them.

Introduction

In a virtual environment without the benefit of Intel® Secure Key, the operating system must rely on hardware interrupts from system activity as a source of entropy. While this can be an acceptable solution for a single client system, this method does not scale well to virtual hosts for several reasons:

  • A server hosting multiple VMs in a data center will not typically have any keyboard or mouse input to contribute to overall system entropy, limiting the quantity of random events available for sampling.
  • Hypervisors virtualize hardware interrupts and inject them into the guest, a technique which further reduces the entropy available to the guest due to quantization.
  • The entropy that remains is shared among several guest systems, and these guests do not have an accurate picture of the total entropy coming from the host: each guest OS assumes it has access to the full system entropy.

The end result is that virtual machines are dividing up a very limited entropy source and assuming that there is more entropy in their pools than is actually available. Secure Key solves this problem by providing a reliable source of entropy with extremely high throughput that can be distributed to individual processes. Each RDRAND instruction results in a random number delivered only to the thread on the virtual machine that requested it, allowing each machine to have its own, discreet source of entropy.

Information about Intel® Secure Key and the DRNG can be found in the Software Implementation Guide.

The Test Environment

To test Secure Key’s ability to meet the entropy demands of a large, virtual environment, we designed a test configuration that was designed to maximize the entropy demands of each virtual host. The hypervisor software, VMware* ESXi 5.1, was installed on a system with two pre-production Intel® Xeon® E5-2650 v2 processors and 64 GB of RAM. This hardware configuration provides 24 physical cores and 48 hardware threads.

Within ESXi we create a total of sixty virtual machines, all clones of a single OS image: Ubuntu* 12.04.2 LTS 64-bit, with one virtual processor. Note that this setup oversubscribes the hardware.

The Ubuntu guest hosts all ran the latest build of the rngd daemon from the rng-tools package. This was obtained from the source repository on github*, and ensures support for Secure Key. The purpose of rngd is to monitor the kernel’s entropy pool, and fill it as needed from external hardware sources of random bytes.

The Secure Key-enabled rngd uses the DRNG as an input source. The DRNG guarantees a reseed of its hardware-based pseudorandom number generator after producing 512 128-bit samples, and thus can produce seed-grade entropy that is acceptable to the Linux kernel by employing AES mixing to combine intermediate samples per the DRNG Software Implementation Guide.

To place a maximum load on the kernel’s entropy pools, the rngtest utility from the rng-tools package was run using /dev/random as an input source. Per the man page, rngtest uses the FIPS 140-2 tests to verify the randomness of its input data and also produces statistics about the speed of the input stream. Used in this manner, rngtest consumes entropy from /dev/random faster than it can be supplied by rngd so that any bottlenecks in the system occur in the source.

The test methodology was as follows:

  1. Start with 1 virtual machine (n =1)
  2. Ssh to the n guest(s) in parallel
  3. Execute rngtest with a 15 minute timeout via the timeout command
  4. Collect the statistics, including FIPS failure counts (if any) and the average input channel speed, from all active VMs
  5. Increase the VM count by 1 (n = n + 1)
    1. If n> 60, stop
    2. If n<= 60, repeat from 2

This procedure resulted in an increasing entropy demand on Secure Key. The more VMs active, the more random bytes the DRNG needed to deliver to the various rngd instances.

Expected Results

The performance limits of Secure Key gave us a rough idea of what to expect. On the E5-2560 v2 processor, the bus connecting the CPU cores to the DRNG limits the total number of RDRAND transactions across all hardware threads on the CPU to about 47.5 million RDRANDs/second. The round-trip latencies for a RDRAND transaction limit each individual hardware thread to about 9 million RDRAND/second. On a 64-bit OS, a RDRAND transaction can be up to 64 bits, so we have the following limits on RDRAND throughput:

  • Single thread: 73 MB/sec
  • All threads: 380 MB/sec

RDRAND throughput scales linearly with the number of threads until the total throughput limit is reached (in this case, 380 MB/sec). However, we have two CPUs in the test system, so that doubles the maximum throughput to 760 MB/sec. Hence, we expect the DRNG to maintain a supply rate of 73 MB/sec to each VM until we have more than 10 active VMs.

When our test is running in 11 VMs, the throughput ceiling is reached, and the fixed, total entropy supply of 760 MB/sec will get divided up amongst the VMs. As more VMs are added, it should be divided even further, with each VM getting a smaller and smaller share, averaging out to 760/n MB/sec where n is the number of virtual machines. There may, however, be some jitter in the results due to congestion on the bus.

The next transition should occur at 25 VMs, where the number of active guests exceeds the physical cores in the test system. Here, we expect to see even more jitter in the results as the CPU relies on Hypter Threading Technology to manage the additional software threads. Though DRNG performance scales with Hyper Threading, the guest OS (and rngd) is doing more than just requesting random numbers. The average entropy rate per VM will continue to trail off, but there should be some variation in each VM’s individual supply.

The last transition is at 49 VMs. Here, the number of guest machines exceeds the physical resources of the CPU. As the threads stack up, the RDRAND requests just get serialized so each VM should see a roughly equal share of entropy, but some threads may get more than others. We expect to see the average entropy rate per VM trail off as we keep adding machines, but with some bumps in each VM’s individual supply rate.

Rngtest reports the input channel speed, in our case the bit rate coming from /dev/random, in Kibits/sec, and rngd is performing a data reduction of 512:1 when generating seed-grade entropy from RDRAND. Converting MB/sec to Kibits/sec and dividing by 512 results in the following expectations from rngtest:

VM CountAverage Input Channel Speed (Kibit/sec)
1-101168
11-6012160/n

Table 1. Expected input channel speeds

Again, the guest OS is doing more than just requesting random numbers from the DRNG so we should expect to see slightly lower performance figures, but these make a useful, theoretical limit.

Results

The theoretical and actual performance figures are shown in Figure 1.


Figure 1. Actual vs. Expected Entropy Rates per VM

With only a few exceptions, the measured bit rate per VM very closely matched with expectations. At one VM, five VMs, and nine VMs, there is a curious drop in the average bit rate that is unexplained. It is interesting that these anomalies occur at the start of a group of four, but the underlying architectural cause is unknown.

Above nine simultaneous VMs, the bit rate drops more quickly than expected, and is probably due to saturation on the bus. Still, the bit rates stay within about 10% of expectations. Above 24 VMs, the difference between expected and actual throughput is barely noticeable.

When the VM count exceeds the number of physical cores, the per-VM throughput varies significantly for each guest, as shown in Figure 2. At this point, the hypervisor is relying on Hyper-Threading to handle the additional workload. When the VM count exceeds the number of physical threads, the hypervisor is oversubscribed and thread scheduling becomes the dominant performance driver. Despite these extreme demands on the system, entropy is still available to every guest OS. At our test limit of 60 virtual machines, each VM was seeing an entropy supply rate of about 200 Kbits/sec (roughly 25 KB/sec).

Conclusions

Through our tests we were able to validate that Intel® Secure Key has sufficient throughput to supply entropy to a large number of VMs, and at very high bit rates. Even when the number of active virtual machines on the system exceeds the cores and physical threads, there is still entropy available at bitrates measured in KB/sec.

In a production data center we would not expect to see such continuous taxing of the DRNG from multiple concurrent VMs, much less on an oversubscribed system. Although this is clearly an artificial test, what it does prove is that Intel® Secure Key is capable of serving entropy to a large number of virtual machines even under the most extreme conditions.


Figure 2. Entropy rates per VM

RDRAND: Do I need to check the carry flag, or can I just check for zero?

$
0
0

One question I have been getting a lot lately is whether you have to check the status of the carry flag to see if a valid random number was returned by RDRAND. The reason why this question gets asked is because of this description of a RDRAND underflow condition, which appears in the DRNG Software Implementation Guide:

After invoking the RDRAND instruction, the caller must examine the carry flag (CF) to determine whether a random value was available at the time the RDRAND instruction was executed. A value of 1 indicates that a random value was available and placed in the destination register provided in the invocation. A value of 0 indicates that a random value was not available. In this case, the destination register will also be zeroed.

It is the final two sentences, the ones which I have indicated with bold type, that are the source of the inquiry. The logic goes like this: if the RDRAND instruction places a zero in the register in addition to clearing the carry flag (CF), can a developer just check for a value of zero, instead?

Strictly speaking, the answer is no. The proper way to determine whether or not RDRAND returned a valid random number is to check the status of CF. If CF=1, the number returned by RDRAND is valid. If CF=0, a random number was not available.

For the rest of this discussion, for simplicity's sake, we'll assume you want to obtain a 64-bit random value. Everything below still applies if you ask for 16- or 32-bit values.

Why this matters

Behind this question is the assumption that the registers are only 0 when the RDRAND instruction cannot return a random number. In other words, that the RDRAND instruction never returns 0 as a random number.

For very early implementations of Intel Data Protection with Secure Key, this was true. The physical hardware, specifically the signaling method used on the bus, did not support an out-of-band method of transmitting an error condition. In order to support error reporting on these early architectures, it was necessary to appropriate one of the possible values in the 64-bit random number space and use that to indicate an error condition. For simplicity, the designers of the DRNG chose the value "zero". On these early architectures, random 64-bit values of zero are discarded, and thus never returned as a valid random number. A zero is only sent when an underflow occurs. This effectively reduces the random number space for RDRAND from 264 to 264 - 1, since the legal range is (1, 264-1) instead of (0, 264-1).

Newer architectures, however, do not have this limitation. Future implementations of Secure Key can return a value of 0 as a valid random number. They return values in the full 264 space, (0, 264-1). On these architectures, checking for a value of zero instead of for the correct condition of CF=0 throws away valid random numbers. In other words, you'll be issuing an extra RDRAND roughly every 264 executions on average.

What if I just ignore results of zero?

While this is a logical question—if I don't care about zero, don't care that my valid range is (1, 264-1), then I can just ignore a zero—but using this as the test for a valid number is still technically incorrect. Software should not act upon the secondary effects of an instruction when making decisions. Only the published error checking procedures should be followed, as those are the ones that are guaranteed to be accurate and work both in the present and the future.

What if I need that zero in my range?

Since early architectures can not return a zero, the recommended way to expand the range to (0, 264-1) is to XOR two RDRAND values together. This is guaranteed to produce a uniform random number in the full range of (0, 264-1) because XOR'ing a value with a uniformly random value results in a uniformly random value. Any bias towards a particular value will be negligible, though if you are paranoid about bias you can XOR multiple values: each XOR operation will result in an increasingly uniform result.

 

§

To Concatenate or Not Concatenate RDRAND?

$
0
0

At the heart of Intel® Data Protection with Secure Key is the digital random number generator (DRNG), a NIST* SP800-90A compliant pseudorandom number generator which is accessed using the RDRAND instruction. Beginning with Intel CPU's code-named Broadwell, Secure Key will also include an SP800-90B and C compliant true random number generator, called an enhanced nondeterministic random number generator in the NIST specifications, that will be accessible via the RDSEED instruction.

Last year, I wrote a short blog post explaining the difference between RDRAND and RDSEED, and providing some recommendations on which instruction to use and under what circumstances. This blog posting was not without some controversy due to the nature of the DRNG and the numbers returned by the RDRAND instruction. While the DRNG generates 128-bit random numbers, the RDRAND instruction can return, at most, 64-bit random numbers. At issue was how to properly create a 128-bit random number from RDRAND when you can only fetch 64-bits at a time.

(Note that this article does not apply to RDSEED. The values returned by RDSEED can always be safely concatenated, as RDSEED is intended for seeding other pseudorandom number generators.)

How do you generate a 128-bit random number using RDRAND?

As one of our DRNG engineers put it, this is a question "that has two and a half answers, depending on who you talk to". The "two answers" part is that there are really two approaches for creating a 128-bit number from two 64-bit (or four 32-bit) random numbers. The "half answer" is that which method you choose depends in part on what you plan to do with the result. My goal is to present both methods as well as the guidance for selecting the correct one in your applications.

Note that the theoretical discussions below are not specific to the DRNG: they apply to any pseudorandom number generator. I reference RDRAND specifically because that is the focus of this article, but the methodologies presented below are universal.

Method #1: Concatenation

This is the simplest and fastest method: you simply concatenate two values together to form a longer number. Mathematically speaking, you fetch two random numbers, R1 and R2, and form R = R1 || R2 where || represents concatenation.  If you have the two 64-bit random numbers 0x807c837d and 0x58b6df9c, you can assemble the 128-bit number 0x807c837d58b6df9c. The order in which the values are arranged does not matter.

The problem with concatenating values from a pseudorandom number generator is that the operation is additive when it comes to determining the total entropy in the resulting value. Entropy is a measure of what is unknown inside of a system. The amount of work required to brute-force predict a random value that has n bits of entropy is O(2n). If you concatenate two values together, the entropy required to brute force the result becomes only 2n + 2n = 2n+1. By combining two n-bit random values, each with n bits of entropy, in this manner you get a random number with effectively only one additional bit of entropy.

Where this gets confusing with respect to the DRNG is that it produces 128-bit values with 128 bits of entropy, but the RDRAND instruction can return at most 64 bits at a time. From a purely cryptographic standpoint, when you are given an n-bit random number you can have at most n bits of entropy. While it can be argued that the DRNG is in reality just splitting a 128-bit value into two pieces and handing them to you one piece at a time, from a theoretical viewpoint this does not matter. While the original value had 128 bits of entropy, the end result is that you are handed two 64-bit numbers one at a time, each of which only has 64 bits of entropy. Because they come from a pseudorandom number generator, the entropy is additive when the values are concatenated, and the resulting 128-bit number has only 65 bits of entropy.

This is a theoretical argument. In practice an attacker may need many more than 265 computations to brute-force predict the random values that come from the DRNG because of its construction, but in security applications it is best practice to design for the ideal attacker.

Does this mean that concatenation is bad and should never be done? No. It just means that there are scenarios where concatenation is inappropriate.

Method #2: Cryptographic Mixing

This method is more involved than concatenation, but it leads to a more robust random number. What you do is concatenate the two random values together as above, but then apply a cryptographic primitive to the concatenated value, discarding the originals. Mathematically, R = F(R1 || R2), where is F is some cryptographic operation.

The advantage of this method over simple concatenation is that it completely destroys the original two random numbers, leaving no trace of them behind. The final value is a number that does not resemble the originals, cannot be identified as such, and the ideal attacker no longer has knowledge of the pseudorandom number sequence. When values are processed in this manner, the entropy is multiplicative. Returning to the amount of work required to brute-force predict a random value, from cryptographic mixing the result is of order 2n * 2n = 22n. Specifically, given two 64-bit numbers from RDRAND and mixed cryptographically, you end up with a 128-bit value that has 128 bits of entropy.

Which cryptographic primitive should you apply to the intermediate values R1 and R2? To completely eliminate all traces of the original value, you want a one-way function.

AES in CBC-MAC mode

This method is described in the Intel® Digital Random Number Generator (DRNG) Software Implementation Guide. Use of 128-bit AES in CBC-MAC mode is used to create 128-bit random values from two 64-bit samples. One large advantage of this method is that it can be done entirely within the registers, which eliminates memory- and cache-based attacks and greatly streamlines execution. It has one disadvantage, though, and that is that it adds a circular dependency on RDRAND: either the IV or the key has to be randomized (or, ideally, both), and those values must come from somewhere, with the logical choice being RDRAND.

SHA2 with HMAC

This method uses SHA256 with HMAC—sometimes written as HMAC_SHA256—to generate a 256-bit digest from the original two values, returning the first 128 bits (or the last; it does not matter which). The reason for using HMAC_SHA256 instead of just plain SHA-2 is that HMAC is provably better at extracting randomness (See "Randomness Extraction and Key Derivation Using the CBC, Cascade and HMAC Modes", by Dodis, Gennaro, Hastad, Krawczyk, and Tabin).

Unlike using AES, the HMAC_SHA256 method does not introduce a dependency on RDRAND, but it cannot easily be implemented in a manner that keeps the intermediate values solely within the registers. This does, in theory, make it vulnerable to memory-based attacks.

So when do I concatenate, and when do I mix cryptographically?

Determining which method to use is pretty straight forward, as what we are talking about here is the theoretical time required to brute-force predict random numbers. It stands to reason, then, that a usage with a short life is more forgiving than one with a long life. The concern is that, given sufficient time and resources, an attacker can brute force the random values before their useful life has ended.

Specifically, that means that concatenation is sufficient for generating nonces, ephemeral keys such as session keys, and general random data. Where you really want to use cryptographic mixing is when you are generating static encryption keys. You can expect those keys to be in use for several years, and the data they protect is at risk if they can be exposed.

But isn't this all theoretical?

Well, yes, but that doesn’t mean we should ignore it. For one, it's always a good idea to follow best practices when programming security functions. This is not an area where we want to encourage sloppy programming. The second is that cryptographers are paranoid for a reason, and that reason is that sometimes theory meets practice. Sometimes it does so in unexpected ways.

Consider the following scenario: A developer creates software that generates 128-bit encryption keys using RDRAND. For simplicity, they use simple concatenation instead of cryptographic mixing, working under the assumption that because RDRAND is generating 128-bit random numbers, those two 64-bit values are really just a permutation of the original 128-bit value. They build their code and install it and it runs fine.

At some point in the future, however, that code may be executed on a machine that does not support the RDRAND instruction. Or, maybe the source code gets modified by someone who adds support for some other pseudorandom number generator. Or a coding error causes the program to choose a source other than RDRAND. The exact cause doesn't matter: instead of pulling from RDRAND, they instead pull 64-bit random numbers from some other source and concatenate them together. If that source has fewer than 128-bits of entropy, then they have measurably reduced the security of the system.

Another scenario: another programmer uses the original program as a reference. Maybe they are less versed in security programming, and blindly assume that concatenation is a valid means of concatenating multiple values from a PRNG because that's how it was done in the example they are following. The original programmer has inadvertently encouraged someone to follow a very, very bad practice.

Do things right, and follow the theory, and you protect yourself both in the present and the future.

Intel® Digital Random Number Generator (DRNG) Software Implementation Guide

$
0
0

Revision 2.0
May 15, 2014

Downloads

Download Intel® Digital Random Number Generator (DRNG) Software Implementation Guide [PDF 975KB]
Download Intel® Digital Random Number Generator software code examples

Related Software

Download Rdrand manual and library (Linux* and OS X* version)
Download Rdrand manual and library (Windows* version)

1. Introduction

Intel® Secure Key, code-named Bull Mountain Technology, is the Intel name for the Intel® 64 and IA-32 Architectures instructions RDRAND and RDSEED and the underlying Digital Random Number Generator (DRNG) hardware implementation. Among other things, the DRNG using the RDRAND instruction is useful for generating high-quality keys for cryptographic protocols, and the RSEED instruction is provided for seeding software-based pseudorandom number generators (PRNGs)

This Digital Random Number Generator Software Implementation Guide is intended to provide a complete source of technical information on RDRAND usage, including code examples. This document includes the following sections:

Section 2: Random Number Generator (RNG) Basics and Introduction to the DRNG. This section describes the nature of an RNG and its pseudo- (PRNG) and true- (TRNG) implementation variants, including modern cascade construction RNGs. We then present the DRNG's position within this broader taxonomy.

Section 3: DRNG Overview. In this section, we provide a technical overview of the DRNG, including its component architecture, robustness features, manner of access, performance, and power requirements.

Section 4: RDRAND and RDSEED Instruction Usage. This section provides reference information on the RDRAND and RDSEED instructions and code examples showing its use. This includes platform support verification and suggestions on DRNG-based libraries.

Programmers who already understand the nature of RNGs may refer directly to section 4 for instruction references and code examples. RNG newcomers who need some review of concepts to understand the nature and significance of the DRNG can refer to section 2. Nearly all developers will want to look at section 3, which provides a technical overview of the DRNG.

2. RNG Basics and Introduction to the DRNG

The Digital Random Number Generator, using the RDRAND instruction, is an innovative hardware approach to high-quality, high-performance entropy and random number generation. To understand how it differs from existing RNG solutions, this section details some of the basic concepts underlying random number generation.

2.1       Random Number Generators (RNGs)

An RNG is a utility or device of some type that produces a sequence of numbers on an interval [min, max] such that values appear unpredictable. Stated a little more technically, we are looking for the following characteristics:

  • Each new value must be statistically independent of the previous value. That is, given a generated sequence of values, a particular value is not more likely to follow after it as the next value in the RNG's random sequence.
  • The overall distribution of numbers chosen from the interval is uniformly distributed. In other words, all numbers are equally likely and none are more "popular" or appear more frequently within the RNG’s output than others.
  • The sequence is unpredictable. An attacker cannot guess some or all of the values in a generated sequence. Predictability may take the form of forward prediction (future values) and backtracking (past values).

Since computing systems are by nature deterministic, producing quality random numbers that have these properties (statistical independence, uniform distribution, and unpredictability) is much more difficult than it might seem. Sampling the seconds value from the system clock, a common approach, may seem random enough, but process scheduling and other system effects may result in some values occurring far more frequently than others. External entropy sources like the time between a user's keystrokes or mouse movements may likewise, upon further analysis, show that values do not distribute evenly across the space of all possible values; some values are more likely to occur than others, and certain values almost never occur in practice.

Beyond these requirements, some other desirable RNG properties include:

  • The RNG is fast in returning a value (i.e., low response time) and can service a large number of requests within a short time interval (i.e., highly scalable).
  • The RNG is secure against attackers who might observe or change its underlying state in order to predict or influence its output or otherwise interfere with its operation.

2.2       Pseudo-Random Number Generators (PRNGs)

One widely used approach for achieving good RNG statistical behavior is to leverage mathematical modeling in the creation of a Pseudo-Random Number Generator. A PRNG is a deterministic algorithm, typically implemented in software that computes a sequence of numbers that "look" random. A PRNG requires a seed value that is used to initialize the state of the underlying model. Once seeded, it can then generate a sequence of numbers that exhibit good statistical behavior.

PRNGs exhibit periodicity that depends on the size of its internal state model. That is, after generating a long sequence of numbers, all variations in internal state will be exhausted and the sequence of numbers to follow will repeat an earlier sequence. The best PRNG algorithms available today, however, have a period that is so large this weakness can practically be ignored. For example, the Mersenne Twister MT19937 PRNG with 32-bit word length has a periodicity of 219937-1. (1)

A key characteristic of all PRNGs is that they are deterministic. That is, given a particular seed value, the same PRNG will always produce the exact same sequence of "random" numbers. This is because a PRNG is computing the next value based upon a specific internal state and a specific, well-defined algorithm. Thus, while a generated sequence of values exhibit the statistical properties of randomness (independence, uniform distribution), overall behavior of the PRNG is entirely predictable.

In some contexts, the deterministic nature of PRNGs is an advantage. For example, in some simulation and experimental contexts, researchers would like to compare the outcome of different approaches using the same sequence of input data. PRNGs provide a way to generate a long sequence of random data inputs that are repeatable by using the same PRNG, seeded with the same value.

In other contexts, however, this determinism is highly undesirable. Consider a server application that generates random numbers to be used as cryptographic keys in data exchanges with client applications over secure communication channels. An attacker who knew the PRNG in use and also knew the seed value (or the algorithm used to obtain a seed value) would quickly be able to predict each and every key (random number) as it is generated. Even with a sophisticated and unknown seeding algorithm, an attacker who knows (or can guess) the PRNG in use can deduce the state of the PRNG by observing the sequence of output values. After a surprisingly small number of observations (e.g., 624 for Mersenne Twister MT19937), each and every subsequent value can be predicted. For this reason, PRNGs are considered to be cryptographically insecure.

PRNG researchers have worked to solve this problem by creating what are known as Cryptographically Secure PRNGs (CSPRNGs). Various techniques have been invented in this domain, for example, applying a cryptographic hash to a sequence of consecutive integers, using a block cipher to encrypt a sequence of consecutive integers ("counter mode"), and XORing a stream of PRNG-generated numbers with plaintext ("stream cipher"). Such approaches improve the problem of inferring a PRNG and its state by greatly increasing its computational complexity, but the resulting values may or may not exhibit the correct statistical properties (i.e., independence, uniform distribution) needed for a robust random number generator. Furthermore, an attacker could discover any deterministic algorithm  by various means (e.g., disassemblers, sophisticated memory attacks, a disgruntled employee). Even more common, attackers may discover or infer PRNG seeding by narrowing its range of possible values or snooping memory in some manner. Once the deterministic algorithm and its seed is known, the attacker may be able to predict each and every random number generated, both past and future.

2.3       True Random Number Generators (TRNGs)

For contexts where the deterministic nature of PRNGs is a problem to be avoided (e.g., gaming and computer security), a better approach is that of True Random Number Generators.

Instead of using a mathematical model to deterministically generate numbers that look random and have the right statistical properties, a TRNG extracts randomness (entropy) from a physical source of some type and then uses it to generate random numbers. The physical source is also referred to as an entropy source and can be selected among a wide variety of physical phenomenon naturally available, or made available, to the computing system using the TRNG. For example, one can attempt to use the time between user key strokes or mouse movements as an entropy source. As pointed out earlier, this technique is crude in practice and resulting value sequences generally fail to meet desired statistical properties with rigor. What to use as an entropy source in a TRNG is a key challenge facing TRNG designers.

Beyond statistical rigor, it is also desirable for TRNGs to be fast and scalable (i.e., capable of generating a large number of random numbers within a small time interval). This poses a serious problem for many TRNGs because sampling an entropy source external to the computing system typically requires device I/O and long delay times relative to the processing speeds of today's computer systems. In general, sampling an entropy source in TRNGs is slow compared to the computation required by a PRNG to simply calculate its next random value. For this reason, PRNGs characteristically provide far better performance than TRNGs and are more scalable.

Unlike PRNGs, however, TRNGs are not deterministic. That is, a TRNG need not be seeded, and its selection of random values in any given sequence is highly unpredictable. As such, an attacker cannot use observations of a particular random number sequence to predict subsequent values in an effective way. This property also implies that TRNGs have no periodicity. While repeats in random sequence are possible (albeit unlikely), they cannot be predicted in a manner useful to an attacker.

2.4       Cascade Construction RNGs

A common approach used in modern operating systems (e.g., Linux* (2)) and cryptographic libraries is to take input from an entropy source in order to supply a buffer or pool of entropy (refer to Figure 1). This entropy pool is then used to provide nondeterministic random numbers that periodically seed a cryptographically secure PRNG (CSPRNG). This CSPRNG provides cryptographically secure random numbers that appear truly random and exhibit a well-defined level of computational attack resistance.

Figure 1. Cascade Construction Random Number Generator

A key advantage of this scheme is performance. It was noted above that sampling an entropy source is typically slow since it often involves device I/O of some type and often additional waiting for a real-time sampling event to transpire. In contrast, CSPRNG computations are fast since they are processor-based and avoid I/O and entropy source delays.This approach offers improved performance: a slow entropy source periodically seeding a fast CSPRNG capable of generating a large number of random values from a single seed.

While this approach would seem ideal, in practice it often falls far short. First, since the implementation is typically in software, it is vulnerable to a broad class of software attacks. For example, considerable state requirements create the potential for memory-based attacks or timing attacks. Second, the approach does not solve the problem of what entropy source to use. Without an external source of some type, entropy quality is likely to be poor. For example, sampling user events (e.g., mouse, keyboard) may be impossible if the system resides in a large data center. Even with an external entropy source, entropy sampling is likely to be slow, making seeding events less frequent than desired.

2.5       Introducing the Digital Random Number Generator (DRNG)

The Digital Random Number Generator (DRNG) is an innovative hardware approach to high-quality, high-performance entropy and random number generation. It is composed of the new Intel 64 Architecture instructions RDRAND and RDSEED and an underlying DRNG hardware implementation.

With respect to the RNG taxonomy discussed above, the DRNG follows the cascade construction RNG model, using a processor resident entropy source to repeatedly seed a hardware-implemented CSPRNG. Unlike software approaches, it includes a high-quality entropy source implementation that can be sampled quickly to repeatedly seed the CSPRNG with high-quality entropy. Furthermore, it represents a self-contained hardware module that is isolated from software attacks on its internal state. The result is a solution that achieves RNG objectives with considerable robustness: statistical quality (independence, uniform distribution), highly unpredictable random number sequences, high performance, and protection against attack.

This method of digital random number generation is unique in its approach to true random number generation in that it is implemented in the processor’s hardware and can be utilized through instructions added to the Intel 64 instruction set. As such, response times are comparable to those of competing PRNG approaches implemented in software. The approach is scalable enough for even demanding applications to use it as an exclusive source of random numbers and not merely a high quality seed for a software-based PRNG. Software running at all privilege levels can access random numbers through the instruction set, bypassing intermediate software stacks, libraries, or operating system handling.

The use of RDRAND and RDSEED leverages a variety of cryptographic standards to ensure the robustness of its implementation and to provide transparency in its manner of operation. These include NIST SP800-90A, B, and C, FIPS-140-2, and ANSI X9.82. Compliance to these standards makes the Digital Random Number Generation a viable solution for highly regulated application domains in government and commerce.

Section 3 describes digital random number generation in detail. Section 4 describes use of RDRAND and RDSEED, the Intel instruction set extensions for using the DRNG.

2.6       Applications for the Digital Random Number Generator

Information security is a key application that utilizes the DRNG. Cryptographic protocols rely on RNGs for generating keys and fresh session values (e.g., a nonce) to prevent replay attacks. In fact, a cryptographic protocol may have considerable robustness but suffer from widespread attack due to weak key generation methods underlying it (e.g., the Debian*/OpenSSL* fiasco (3)). The DRNG can be used to fix this weakness, thus significantly increasing cryptographic robustness.

Closely related are government and industry applications. Due to information sensitivity, many such applications must demonstrate their compliance with security standards like FISMA, HIPPA, PCIAA, etc. RDRAND has been engineered to meet existing security standards like NIST SP800-90, FIPS 140-2, and ANSI X9.82, and thus provides an underlying RNG solution that can be leveraged in demonstrating compliance with information security standards.

Other uses of the DRNG include:

  • Communication protocols
  • Monte Carlo simulations and scientific computing
  • Gaming applications
  • Bulk entropy applications like secure disk wiping or document shredding
  • Protecting online services against RNG attacks
  • Seeding software-based PRNGs of arbitrary width

3         DRNG Overview

In this section, we describe in some detail the components of the DRNG using the RDRAND and RDSEED instructions and their interaction.

3.1       Processor View

Figure 2 provides a high-level schematic of the RDRAND and RDSEED Random Number Generators. As shown, the DRNG appears as a hardware module on the processor. An interconnect bus connects it with each core.

Figure 2. Digital Random Number Generator design

The RDRAND and RDSEED instructions (detailed in section 4) are handled by microcode on each core. This includes an RNG microcode module that handles interactions with the DRNG hardware module on the processor.

3.2       Component Architecture

As shown in Figure 3, the DRNG can be thought of as three logical components forming an asynchronous production pipeline: an entropy source (ES) that produces random bits from a nondeterministic hardware process at around 3 Gbps, a conditioner that uses AES (4) in CBC-MAC (5) mode to distill the entropy into high-quality nondeterministic random numbers, and two parallel outputs:

  1. A deterministic random bit generator (DRBG) seeded from the conditioner.
  2. An enhanced, nondeterministic random number generator (ENRNG) that provides seeds from the entropy conditioner.

Note that the conditioner does not send the same seed values to both the DRBG and the ENRNG. This pathway can be thought of as an alternating switch, with one seed going to the DRGB and the next seed going to the ENRNG. This construction ensures that a software application can never obtain the value used to seed the DRBG, nor can it launch a Denial of Service (DoS) attack against the DRBG through repeated executions of the RDSEED instruction.

The conditioner can be equated to the entropy pool in the cascade construction RNG described previously. However, since it is fed by a high-quality, high-speed, continuous stream of entropy that is fed faster than downstream processes can consume, it does not need to maintain an entropy pool. Instead, it is always conditioning fresh entropy independent of past and future entropy.

Figure 3. DRNG Component Architecture

The final two stages are:

  1. A hardware CSPRNG that is based on AES in CTR mode and is compliant with SP800-90A. In SP800-90A terminology, this is referred to as a DRBG (Deterministic Random Bit Generator), a term used throughout the remainder of this document.
  2. An ENRNG (Enhanced Non-deterministic Random Number Generator) that is compliant with SP800-90B and C.

 

3.2.1     Entropy Source (ES)

The all-digital Entropy Source (ES), also known as a non-deterministic random bit generator (NRBG), provides a serial stream of entropic data in the form of zeroes and ones.

The ES runs asynchronously on a self-timed circuit and uses thermal noise within the silicon to output a random stream of bits at the rate of 3 GHz. The ES needs no dedicated external power supply to run, instead using the same power supply as other core logic. The ES is designed to function properly over a wide range of operating conditions, exceeding the normal operating range of the processor.

Bits from the ES are passed to the conditioner for further processing.

3.2.2     Conditioner

The conditioner takes pairs of 256-bit raw entropy samples generated by the ES and reduces them to a single 256-bit conditioned entropy sample using AES-CBC-MAC. This has the effect of distilling the entropy into more concentrated samples.

AES, Advanced Encryption Standard, is defined in the FIPS-197 Advanced Encryption Standard (4). CBC-MAC, Cipher Block Chaining - Message Authentication Code, is defined in NIST SP 800-38A Recommendation for Block Cipher Modes of Operation (5).

The conditioned entropy is output as a 256-bit value and passed to the next stage in the pipeline to be used as a DRBG seed value.

3.2.3     Deterministic Random Bit Generator (DRBG)

The role of the deterministic random bit generator (DRBG) is to "spread" a conditioned entropy sample into a large set of random values, thus increasing the amount of random numbers available by the hardware module. This is done by employing a standards-compliant DRBG and continuously reseeding it with the conditioned entropy samples.

The DRBG chosen for this function is the CTR_DRBG defined in section 10.2.1 of NIST SP 800-90A (6), using the AES block cipher. Values that are produced fill a FIFO output buffer that is then used in responding to RDRAND requests for random numbers.

The DRBG autonomously decides when it needs to be reseeded to refresh the random number pool in the buffer and is both unpredictable and transparent to the RDRAND caller. An upper bound of 511 128-bit samples will be generated per seed. That is, no more than 511*2=1022 sequential DRNG random numbers will be generated from the same seed value.

3.2.4     Enhanced Non-deterministic Random Number Generator

The role of the enhanced non-deterministic random number generator is to make conditioned entropy samples directly available to software for use as seeds to other software-based DRBGs. Values coming out of the ENRNG have multiplicative brute-force prediction resistance, which means that samples can be concatenated and the brute-force prediction resistance will scale with them. When two 64-bit samples are concatenated together, the resulting 128-bit value will have 128 bits of brute-force prediction resistance (264 * 264 = 2128). This operation can be repeated indefinitely and can be used to easily produce random seeds of arbitrary size. Because of this property, these values can be used to seed a DRBG of any size.

3.2.5     Robustness and Self-Validation

To ensure the DRNG functions with a high degree of reliability and robustness, validation features have been included that operate in an ongoing manner at system startup. These include the DRNG Online Health Tests (OHTs) and Built-In Self Tests (BISTs), respectively. Both are shown in Figure 4.

Figure 4. DRNG Self-Validation Components

3.2.6     Online Health Tests (OHTs)

Online Health Tests (OHTs) are designed to measure the quality of entropy generated by the ES using both per sample and sliding window statistical tests in hardware.

Per sample tests compare bit patterns against expected pattern arrival distributions as specified by a mathematical model of the ES. An ES sample that fails this test is marked "unhealthy." Using this distinction, the conditioner can ensure that at least two healthy samples are mixed into each seed. This defends against hardware attacks that might seek to reduce the entropic content of the ES output.

Sliding window tests look at sample health across many samples to verify they remain above a required threshold. The sliding window size is large (65536 bits) and mechanisms ensure that the ES is operating correctly overall before it issues random numbers. In the rare event that the DRNG fails during runtime, it would cease to issue random numbers rather than issue poor quality random numbers.

3.2.7     Built-In Self Tests (BISTs)

Built-In Self Tests (BISTs) are designed to verify the health of the ES prior to making the DRNG available to software. These include Entropy Source Tests (ES-BIST) that are statistical in nature and comprehensive test coverage of all the DRNG’s deterministic downstream logic through BIST Known Answer Tests (KAT-BIST).

ES-BIST involves running the DRNG for a probationary period in its normal mode before making the DRNG available to software. This allows the OHTs to examine ES sample health for a full sliding window (256 samples) before concluding that ES operation is healthy. It also fills the sliding window sample pipeline to ensure the health of subsequent ES samples, seeds the PRNG, and fills the output queue of the DRNG with random numbers.

KAT-BIST tests both OHT and end-to-end correctness using deterministic input and output validation. First, various bit stream samples are input to the OHT, including a number with poor statistical quality. Samples cover a wide range of statistical properties and test whether the OHT logic correctly identifies those that are "unhealthy." During the KAT-BIST phase, deterministic random numbers are output continuously from the end of the pipeline. The BIST Output Test Logic verifies that the expected outputs are received.

If there is a BIST failure during startup, the DRNG will not issue random numbers and will issue a BIST failure notification to the on-processor test circuitry. This BIST logic avoids the need for conventional on-processor test mechanisms (e.g., scan and JTAG) that could undermine the security of the DRNG.

3.3       Instructions

Software access to the DRNG is through the RDRAND and RDSEED instructions, documented in Chapter 3 of (7).

3.3.1     RDRAND

RDRAND retrieves a hardware-generated random value from the SP800-90A compliant DRGB and stores it in the destination register given as an argument to the instruction. The size of the random value (16-, 32-, or 64-bits) is determined by the size of the register given. The carry flag (CF) must be checked to determine whether a random value was available at the time of instruction execution.

Note that RDRAND is available to any system or application software running on the platform. That is, there are no hardware ring requirements that restrict access based on process privilege level. As such, RDRAND may be invoked as part of an operating system or hypervisor system library, a shared software library, or directly by an application.

To determine programmatically whether a given Intel platform supports RDRAND, developers can use the CPUID instruction to examine bit 30 of the ECX register. See Reference (7) for details.

3.3.2     RDSEED

RDSEED retrieves a hardware-generated random seed value from the SP800-90B and C compliant ENRNG and stores it in the destination register given as an argument to the instruction. Like the RDRAND instruction, the size of the random value is determined by the size of the given register, and the carry flag (CF) must be checked to determine whether or not a random seed was available at the time the instruction was executed.

Also like RDRAND, there are no hardware ring requirements that restrict access to RDSEED based on process privilege level.

To determine programmatically whether a given Intel platform supports the RDSEED instruction, developers can use the CPUID instruction to examine bit 18 of the EBX register. See Reference (8) for details.

3.4       DRNG Performance

Designed to be a high performance entropy resource shared between multiple cores/threads, the Digital Random Number Generator represents a new generation of RNG performance.

The DRNG is implemented in hardware as part of the Intel processor. As such, both the entropy source and the DRBG execute at processor clock speeds. Unlike other hardware-based solutions, there is no system I/O required to obtain entropy samples and no off-processor bus latencies to slow entropy transfer or create bottlenecks when multiple requests have been issued.

Random values are delivered directly through instruction level requests (RDRAND and RDSEED). This bypasses both operating system and software library handling of the request. The DRNG is scalable enough to support heavy server application workloads. Within the context of virtualization, the DRNG's stateless design and atomic instruction access mean that RDRAND and RDSEED can be used freely by multiple VMs without the need for hypervisor intervention or resource management.

3.4.1     RDRAND Performance

In current-generation Intel processors the DRBG runs on a self-timed circuit clocked at 800 MHz and can service a RDRAND transaction (1 Tx) every 8 clocks for a maximum of 100 MTx per second. A transaction can be for a 16-, 32-, or 64-bit RDRAND, and the greatest throughput is achieved with 64-bit RDRANDs, capping the throughput ceiling at 800 MB/sec. These limits are an upper bound on all hardware threads across all cores on the CPU.

Single thread performance is limited by the instruction latencies imposed by the bus infrastructure, which is also impacted in part by clock speed. On real-world systems, a single thread executing RDRAND continuously may see throughputs ranging from 70 to 200 MB/sec, depending on the SPU architecture.

If multiple threads are invoking RDRAND simultaneously, total RDRAND throughput (across all threads) scales approximately linearly with the number of threads until no more hardware threads remain, the bus limits of the processor are reached, or the DRNG interface is fully saturated. Past this point, the maximum throughput is divided equally among the active threads. No threads get starved.

Figure 5 shows the multithreaded RDRAND throughput plotted as a ratio to single thread throughput for six different CPU architectures. The dotted line represents linear scaling. This shows that total RDRAND throughput scales nearly linearly with the number of active threads on the CPU, prior to reaching saturation.

Figure 5. Multithreaded RDRNAD throughput scaling

Figure 6 shows the multithreaded performance of a single system, also as a ratio, up to saturation and beyond. As in Figure 5, total throughput scales nearly linearly until saturation, at which point it reaches a steady state.

Figure 6. RDRAND throughput past saturation

Results have been estimated based on internal Intel® analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

3.5       Power Requirements

The DRNG hardware resides on the processor and, therefore, does not need a dedicated power supply to run. Instead, it simply uses the processor's local power supply. As described in section 3.2.1, the hardware is designed to function across a range of process voltage and temperature (PVT) levels, exceeding the normal operating range of the processor.

The DRNG hardware does not impact power management mechanisms and algorithms associated with individual cores. For example, ACPI-based mechanisms for regulating processor performance states (P-states) and processor idle states (C-states) on a per core basis are unaffected.

To save power, the DRNG clock gates itself off when queues are full. This idle-based mechanism results in negligible power requirements whenever entropy computation and post processing are not needed.

4         Instruction Usage

In this section, we provide instruction references for RDRAND and RDSEED and usage examples for programmers. All code examples in this guide are licensed under the new, 3-clause BSD license, making them freely usable within nearly any software context.

For additional details on RDRAND usage and code examples, see Reference (7).

4.1       Determining Processor Support for RDRAND and RDSEED

Before using the RDRAND or RDSEED instructions, an application or library should first determine whether the underlying platform supports the instruction, and hence includes the underlying DRNG feature. This can be done using the CPUID instruction. In general, CPUID is used to return processor identification and feature information stored in the EAX, EBX, ECX, and EDX registers. For detailed information on CPUID, refer to References (7) and (8).

To be specific, support for RDRAND can be determined by examining bit 30 of the ECX register returned by CPUID, and support for RDSEED can be determined by examining bit 31 of the EBX register. As shown in Table 1 (below) and 2-23 in Reference (7), a value of 1 indicates processor support for the instruction while a value of 0 indicates no processor support.

Table 1. Feature information returned in the ECX register

Leaf

Register

Bit

Mnemonic

Description

1

ECX

30

RDRAND

A value of 1 indicates that processor supports the RDRAND instruction

7

EBX

18

RDSEED

A value of 1 indicates that processor supports the RDSEED instruction

The two options for invoking the CPUID instruction within the context of a high-level programming language like C or C++ are with:

  • An inline assembly routine
  • An assembly routine defined in an independent file.

The advantage of inline assembly is that it is straightforward and easily readable within its source code context. The disadvantage, however, is that conditional code is often needed to handle the possibility of different underlying platforms, which can quickly compromise readability. This guide describes a Linux implementation that should also work on OS X*. Please see the DRNG downloads for Windows* examples.

Code Example 1 shows the definition of the function get_drng_support for gcc compilation on 64-bit Linux. This is included in the source code module drng.c that is included in the DRNG samples source code download that accompanies this guide.

/* These are bits that are OR’d together */

#define DRNG_NO_SUPPORT	0x0	/* For clarity */
#define DRNG_HAS_RDRAND	0x1
#define DRNG_HAS_RDSEED	0x2

int get_drng_support ()
{
	static int drng_features= -1;

	/* So we don't call cpuid multiple times for
	 * the same information */

	if ( drng_features == -1 ) {
		drng_features= DRNG_NO_SUPPORT;

		if ( _is_intel_cpu() ) {
			cpuid_t info;

			cpuid(&info, 1, 0);

			if ( (info.ecx & 0x40000000) == 0x40000000 ) {
				drng_features|= DRNG_HAS_RDRAND;
			}

			cpuid(&info, 7, 0);

			if ( (info.ebx & 0x40000) == 0x40000 ) {
				drng_features|= DRNG_HAS_RDSEED;
			}
		}
	}

	return drng_features;
}

Code Example 1. Determining support for RDRAND and RDSEED on 64-bit Linux*

This function first determines if the processor is an Intel CPU by calling into the _is_intel_cpu() function, which is defined in Code Example 2. If it is, the function then checks the feature bits using the CPUID instruction to determine instruction support.

typedef struct cpuid_struct {
	unsigned int eax;
	unsigned int ebx;
	unsigned int ecx;
	unsigned int edx;
} cpuid_t;

int _is_intel_cpu ()
{
	static int intel_cpu= -1;
	cpuid_t info;

	if ( intel_cpu == -1 ) {
		cpuid(&info, 0, 0);

		if (
			memcmp((char *) &info.ebx, "Genu", 4) ||
			memcmp((char *) &info.edx, "ineI", 4) ||
			memcmp((char *) &info.ecx, "ntel", 4)
		) {
			intel_cpu= 0;
		} else {
			intel_cpu= 1;
		}
	}

	return intel_cpu;
}

void cpuid (cpuid_t *info, unsigned int leaf, unsigned int subleaf)
{
	asm volatile("cpuid"
	: "=a" (info->eax), "=b" (info->ebx), "=c" (info->ecx), "=d" (info->edx)
	: "a" (leaf), "c" (subleaf)
	);
}

Code Example 2.Calling CPUID on 64-bit Linux

The CPUID instruction is run using inline assembly via the cpuid() function. It is declared as “volatile” as a precautionary measure, to prevent the compiler from applying optimizations that might interfere with its execution.

4.2       Using RDRAND to Obtain Random Values

Once support for RDRAND can be verified using CPUID, the RDRAND instruction can be invoked to obtain a 16-, 32-, or 64-bit random integer value. Note that this instruction is available at all privilege levels on the processor, so system software and application software alike may invoke RDRAND freely.

Reference (7) provides a table describing RDRAND instruction usage as follows:

Table 2. RDRAND instruction reference and operand encoding

Opcode/
Instruction

Op/
En

64/32
bit Mode Support

CPUID Feature
Flag

Description

0F C7 /6
RDRAND r16

A

V/V

RDRAND

Read a 16-bit random number and store in the destination register.

0F C7 /6
RDRAND r32

A

V/V

RDRAND

Read a 32-bit random number and store in the destination register.

REX.W + 0F C7 /6
RDRAND r64

A

V/I

RDRAND

Read a 64-bit random number and store in the destination register.

 

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

A

ModRM:r/m (w)

NA

NA

NA

 

Essentially, developers invoke this instruction with a single operand: the destination register where the random value will be stored. Note that this register must be a general purpose register, and the size of the register (16, 32, or 64 bits) will determine the size of the random value returned.

After invoking the RDRAND instruction, the caller must examine the carry flag (CF) to determine whether a random value was available at the time the RDRAND instruction was executed. As Table 3 shows, a value of 1 indicates that a random value was available and placed in the destination register provided in the invocation. A value of 0 indicates that a random value was not available. In current architectures the destination register will also be zeroed as a side effect of this condition.

Note that a destination register value of zero should not be used as an indicator of random value availability. The CF is the sole indicator of the success or failure of the RDRAND instruction.

Table 3. Carry Flag (CF) outcome semantics.

Carry Flag Value

Outcome

CF = 1

Destination register valid. Non-zero random value available at time of execution. Result placed in register.

CF = 0

Destination register all zeroes. Random value not available at time of execution. May be retried.

4.2.1     Retry Recommendations

It is recommended that applications attempt 10 retries in a tight loop in the unlikely event that the RDRAND instruction does not return a random number. This number is based on a binomial probability argument: given the design margins of the DRNG, the odds of ten failures in a row are astronomically small and would in fact be an indication of a larger CPU issue.

4.2.2     Simple RDRAND Invocation

The unlikely possibility that a random value may not be available at the time of RDRAND instruction invocation has significant implications for system or application API definition. While many random functions are defined quite simply in the form:

unsigned intGetRandom()

use of RDRAND requires wrapper functions that appropriately manage the possible outcomes based on the CF flag value.

One handling approach is to simply pass the instruction outcome directly back to the invoking routine. A function signature for such an approach may take the form:

int rdrand(unsigned int *therand)

Here, the return value of the function acts as a flag indicating to the caller the outcome of the RDRAND instruction invocation. If the return value is 1, the variable passed by reference will be populated with a usable random value. If the return value is 0, the caller understands that the value assigned to the variable is not usable. The advantage of this approach is that it gives the caller the option to decide how to proceed based on the outcome of the call.

Code Example 3 shows this implemented for 16-, 32-, and 64-bit invocations of RDRAND using inline assembly.

#define

int rdrand16_step (uint16_t *rand)
{
	unsigned char ok;

	asm volatile ("rdrand %0; setc %1"
		: "=r" (*rand), "=qm" (ok));

	return (int) ok;
}

int rdrand32_step (uint32_t *rand)
{
	unsigned char ok;

	asm volatile ("rdrand %0; setc %1"
		: "=r" (*rand), "=qm" (ok));

	return (int) ok;
}

int rdrand64_step (uint64_t *rand)
{
	unsigned char ok;

	asm volatile ("rdrand %0; setc %1"
		: "=r" (*rand), "=qm" (ok));

	return (int) ok;
}

Code Example 3. Simple RDRAND invocations for 16-bit, 32-bit, and 64-bit values

4.2.3     RDRAND Retry Loop

An alternate approach if random values are unavailable at the time of RDRAND execution is to use a retry loop. In this approach, an additional argument allows the caller to specify the maximum number of retries before returning a failure value. Once again, the success or failure of the function is indicated by its return value and the actual random value, assuming success, is passed to the caller by a reference variable.

Code Example 4 shows an implementation of RDRAND invocations with a retry loop.

int rdrand16_retry (unsigned int retries, uint16_t *rand)
{
	unsigned int count= 0;

	while ( count <= retries ) {
		if ( rdrand16_step(rand) ) {
			return 1;
		}

		++count;
	}

	return 0;
}

int rdrand32_retry (unsigned int retries, uint32_t *rand)
{
	unsigned int count= 0;

	while ( count <= retries ) {
		if ( rdrand32_step(rand) ) {
			return 1;
		}

		++count;
	}

	return 0;
}

int rdrand64_retry (unsigned int retries, uint64_t *rand)
{
	unsigned int count= 0;

	while ( count <= retries ) {
		if ( rdrand64_step(rand) ) {
			return 1;
		}

		++count;
	}

	return 0;
}

Code Example 4.RDRAND invocations with a retry loop

4.2.4     Initializing Data Objects of Arbitrary Size

A common function within RNG libraries is shown below:

int rdrand_get_bytes(unsigned int n, unsigned char *dest)

In this function, a data object of arbitrary size is initialized with random bytes. The size is specified by the variable n, and the data object is passed in as a pointer to unsigned char or void.

Implementing this function requires a loop control structure and iterative calls to the rdrand64_step() or rdrand32_step() functions shown previously. To simplify, let's first consider populating an array of unsigned int with random values in this manner using rdrand32_step().

unsigned int rdrand_get_n_uints (unsigned int n, unsigned int *dest)
{
	unsigned int i;
	uint32_t *lptr= (uint32_t *) dest;

	for (i= 0; i< n; ++i, ++dest) {
		if ( ! rdrand32_step(dest) ) {
			return i;
		}
	}

	return n;
}

Code Example 5. Initializing an array of 32-bit integers

The function returns the number of unsigned int values assigned. The caller would check this value against the number requested to determine whether assignment was successful. Other implementations are possible, for example, using a retry loop to handle the unlikely possibility of random number unavailability.

In the next example, we reduce the number of RDRAND calls in half by using rdrand64_step() instead of rdrand32_step().

unsigned int rdrand_get_n_uints (unsigned int n, unsigned int *dest)
{
	unsigned int i;
	uint64_t *qptr= (uint64_t *) dest;
	unsigned int total_uints= 0;
	unsigned int qwords= n/2;

	for (i= 0; i< qwords; ++i, ++qptr) {
		if ( rdrand64_retry(RDRAND_RETRIES, qptr) ) {
			total_uints+= 2;
		} else {
			return total_uints;
		}
	}

	/* Fill the residual */

	if ( n%2 ) {
		unsigned int *uptr= (unsigned int *) qptr;

		if ( rdrand32_step(uptr) ) {
			++total_uints;
		}
	}

	return total_uints;
}

Code Example 6.Initializing an object of arbitrary size using RDRAND

Finally, we show how a loop control structure and rdrand64_step() can be used to populate a byte array with random values.

unsigned int rdrand_get_bytes (unsigned int n, unsigned char *dest)
{
	unsigned char *headstart, *tailstart;
	uint64_t *blockstart;
	unsigned int count, ltail, lhead, lblock;
	uint64_t i, temprand;

	/* Get the address of the first 64-bit aligned block in the
	 * destination buffer. */

	headstart= dest;
	if ( ( (uint64_t)headstart % (uint64_t)8 ) == 0 ) {

		blockstart= (uint64_t *)headstart;
		lblock= n;
		lhead= 0;
	} else {
		blockstart= (uint64_t *)
			( ((uint64_t)headstart & ~(uint64_t)7) + (uint64_t)8 );

		lblock= n - (8 - (unsigned int) ( (uint64_t)headstart & (uint64_t)8 ));

		lhead= (unsigned int) ( (uint64_t)blockstart - (uint64_t)headstart );
	}

	/* Compute the number of 64-bit blocks and the remaining number
	 * of bytes (the tail) */

	ltail= n-lblock-lhead;
	count= lblock/8;	/* The number 64-bit rands needed */

	if ( ltail ) {
		tailstart= (unsigned char *)( (uint64_t) blockstart + (uint64_t) lblock );
	}

	/* Populate the starting, mis-aligned section (the head) */

	if ( lhead ) {
		if ( ! rdrand64_retry(RDRAND_RETRIES, &temprand) ) {
			return 0;
		}

		memcpy(headstart, &temprand, lhead);
	}

	/* Populate the central, aligned block */

	for (i= 0; i< count; ++i, ++blockstart) {
		if ( ! rdrand64_retry(RDRAND_RETRIES, blockstart) ) {
			return i*8+lhead;
		}
	}

	/* Populate the tail */

	if ( ltail ) {
		if ( ! rdrand64_retry(RDRAND_RETRIES, &temprand) ) {
			return count*8+lhead;
		}

		memcpy(tailstart, &temprand, ltail);
	}

	return n;
}

Code Example 7.Initializing an object of arbitrary size using RDRAND

4.2.5     Guaranteeing DBRG Reseeding

As a high performance source of random numbers, the DRNG is both fast and scalable. It is directly usable as a sole source of random values underlying an application or operating system RNG library. Still, some software venders will want to use the DRNG to seed and reseed in an ongoing manner their current software PRNG. Some may feel it necessary, for standards compliance, to demand an absolute guarantee that values returned by RDRAND reflect independent entropy samples within the DRNG.

As described in section 3.2.3, the DRNG uses a deterministic random bit generator, or DRBG, to "spread" a conditioned entropy sample into a large set of random values, thus increasing the number of random numbers available by the hardware module. The DRBG autonomously decides when it needs to be reseeded, behaving in a way that is unpredictable and transparent to the RDRAND caller. There is an upper bound of 511 samples per seed in the implementation where samples are 128 bits in size and can provide two 64-bit random numbers each. In practice, the DRBG is reseeded frequently, and it is generally the case that reseeding occurs long before the maximum number of samples can be requested by RDRAND.

There are two approaches to structuring RDRAND invocations such that DRBG reseeding can be guaranteed:

  • Iteratively execute RDRAND beyond the DRBG upper bound by executing more than 1022 64-bit RDRANDs
  • Iteratively execute 32 RDRAND invocations with a 10 us wait period per iteration.

The latter approach has the effect of forcing a reseeding event since the DRBG aggressively reseeds during idle periods.

4.2.6     Generating Seeds from RDRAND

Processors that do not support the RDSEED instruction can leverage the reseeding guarantee of the DRBG to generate random seeds from values obtained via RDRAND.

The program below takes the first approach of guaranteed reseeding—generating 512 128-bit random numbers—and mixes the intermediate values together using the CBC-MAC mode of AES. This method of turning 512 128-bit samples from the DRNG into a 128-bit seed value is sometimes referred to as the “512:1 data reduction” and results in a random value that is fully forward and backward prediction resistant, suitable for seeding a NIST SP800-90 compliant, FIPS 1402-2 certifiable, software DRBG.

This program relies on libgcrypt from the Gnu Project for the encryption routines.

#include “drng.h”
#include
#include
#include

#define AES_BLOCK_SIZE	16	/* AES uses 128-bit blocks (16 bytes) */
#define AES_KEY_SIZE	16	/* AES with 128-bit key (AES-128) */
#define	RDRAND_SAMPLES	512	/* the DRNG reseeds after generating 511
				 * 128-bit (16-byte) values */
#define BUFFER_SIZE		16*RDRAND_SAMPLES

#define MIN_GCRYPT_VERSION "1.0.0"

int main (int argc, char *argv[])
{
	unsigned char rbuffer[BUFFER_SIZE];
	unsigned char aes_key[AES_KEY_SIZE];
	unsigned char aes_iv[AES_KEY_SIZE];
	unsigned char seed[16];
	static gcry_cipher_hd_t gcry_cipher_hd;
	gcry_error_t gcry_error;

	if ( ! ( get_drng_support() & DRNG_HAS_RDRAND ) ) {
		fprintf(stderr, "No RDRAND supportn");
		return 1;
	}

	/* Generate a random AES key */

	if ( rdrand_get_bytes(AES_KEY_SIZE, aes_key) < AES_KEY_SIZE ) {
		fprintf(stderr, "Random numbers not availablen");
		return 1;
	}

	/* Generate a random IV */

	if ( rdrand_get_bytes(AES_BLOCK_SIZE, aes_iv) < AES_BLOCK_SIZE ) {
		fprintf(stderr, "Random numbers not availablen");
		return 1;
	}

	/*
	 * Fill our buffer with 512 128-bit rdrands. This
	 * guarantees that /at least/ one reseed takes place.
	 */

	if ( rdrand_get_bytes(BUFFER_SIZE, rbuffer) < BUFFER_SIZE ) {
		fprintf(stderr, "Random numbers not availablen");
		return 1;
	}

	/* Initialize the cryptographic library */

	if (!gcry_check_version(MIN_GCRYPT_VERSION)) {
		fprintf(stderr,
			"gcry_check_version: have version %s, need version %s or newer",
			gcry_check_version(NULL), MIN_GCRYPT_VERSION
		);

		return 1;
	}

	gcry_error= gcry_cipher_open(&gcry_cipher_hd, GCRY_CIPHER_AES128,
		GCRY_CIPHER_MODE_CBC, 0);
	if ( gcry_error ) {
		fprintf(stderr, "gcry_cipher_open: %s", gcry_strerror(gcry_error));
		return 1;
	}

	gcry_error= gcry_cipher_setkey(gcry_cipher_hd, aes_key, AES_KEY_SIZE);
	if ( gcry_error ) {
		fprintf(stderr, "gcry_cipher_setkey: %s", gcry_strerror(gcry_error));
		gcry_cipher_close(gcry_cipher_hd);
		return 1;
	}

	gcry_error= gcry_cipher_setiv(gcry_cipher_hd, aes_iv, AES_BLOCK_SIZE);
	if ( gcry_error ) {
		fprintf(stderr, "gcry_cipher_setiv: %s", gcry_strerror(gcry_error));
		gcry_cipher_close(gcry_cipher_hd);
		return 1;
	}

	/*
	 * Do the encryption in-place. This has the nice side effect of
	 * erasing the original values.
	 */

	gcry_error= gcry_cipher_encrypt(gcry_cipher_hd, rbuffer, BUFFER_SIZE,
		NULL, 0);
	if ( gcry_error ) {
		fprintf(stderr, "gcry_cipher_encrypt: %sn",
			gcry_strerror(gcry_error));
		return 1;
	}

	gcry_cipher_close(gcry_cipher_hd);

	/* The last block of the cipher text is the MAC, and our seed value. */

	memcpy(seed, &rbuffer[BUFFER_SIZE-16], 16);

Code Example 8. Generating random seeds from RDRAND

4.3       Using RDSEED to Obtain Random Seeds

Once support for RDSEED has been verified using CPUID, the RDSEED instruction can be used to obtain a 16-, 32-, or 64-bit random integer value. Again, this instruction is available at all privilege levels on the processor, so system software and application software alike may invoke RDSEED freely.

RDSEED instruction is documented in (9). Usage is as follows:

Table 4. RDSEED instruction reference and operand encoding

Opcode/
Instruction

Op/
En

64/32
bit Mode Support

CPUID Feature
Flag

Description

0F C7 /7
RDSEED r16

A

V/V

RDSEED

Read a 16-bit random number and store in the destination register.

0F C7 /7
RDSEED r32

A

V/V

RDSEED

Read a 32-bit random number and store in the destination register.

REX.W + 0F C7 /7
RDSEED r64

A

V/I

RDSEED

Read a 64-bit random number and store in the destination register.

 

Op/En

Operand 1

Operand 2

Operand 3

Operand 4

A

ModRM:r/m (w)

NA

NA

NA

 

As with RDRAND, developers invoke the RDSEED instruction with the destination register where the random seed will be stored. This register must be a general purpose one whose size determines the size of the random seed that is returned.

After invoking the RDSEED instruction, the caller must examine the carry flag (CF) to determine whether a random seed was available at the time the RDSEED instruction was executed. As shown in Table 5, a value of 1 indicates that a random seed was available and placed in the destination register provided in the invocation. A value of 0 indicates that a random seed was not available. In current architectures the destination register will also be zeroed as a side effect of this condition.

Again, a destination register value of zero should not be used as an indicator of random seed availability. The CF is the sole indicator of the success or failure of the RDSEED instruction.

Table 5. Carry Flag (CF) outcome semantics

Carry Flag Value

Outcome

CF = 1

Destination register valid. Non-zero random seed available at time of execution. Result placed in register.

CF = 0

Destination register all zeroes. Random seed not available at time of execution. May be retried.

4.3.1     Retry Recommendations

Unlike the RDRAND instruction, the seed values come directly from the entropy conditioner, and it is possible for callers to invoke RDSEED faster than those values are generated. This means that applications must be designed robustly and be prepared for calls to RDSEED to fail because seeds are not available (CF=0).

If only one thread is calling RDSEED infrequently, it is very unlikely that a random seed will not be available. Only during periods of heavy demand, such as when one thread is calling RDSEED in rapid succession or multiple threads are calling RDSEED simultaneously, are underflows likely to occur. Because the RDSEED instruction does not have a fairness mechanism built into it, however, there are no guarantees as to how often a thread should retry the instruction, or how many retries might be needed, in order to obtain a random seed. In practice, this depends on the number of hardware threads on the CPU and how aggressively they are calling RDSEED.

Since there is no simple procedure for retrying the instruction to obtain a random seed, follow these basic guidelines.

4.3.1.1     Synchronous applications

If the application is not latency-sensitive, then it can simply retry the RDSEED instruction indefinitely, though it is recommended that a PAUSE instruction be placed in the retry loop. In the worst-case scenario, where multiple threads are invoking RDSEED continually, the delays can be long, but the longer the delay, the more likely (with an exponentially increasing probability) that the instruction will return a result.

If the application is latency-sensitive, then applications should either sleep or fall back to generating seed values from RDRAND.

4.3.1.2     Asynchronous applications

The application should be prepared to give up on RDSEED after a small number of retries, where "small" is somewhere between 1 and 100, depending on the application's sensitivity to delays. As with synchronous applications, it is recommended that a PAUSE instruction be inserted into the retry loop.

Applications needing a more aggressive approach can alternate between RDSEED and RDRAND, pulling seeds from RDSEED as they are available and filling a RDRAND buffer for future 512:1 reduction when they are not.

4.3.2     Simple RDSEED Invocation

Code Example 9 shows inline assembly implementations for 16-, 32-, and 64-bit invocations of RDSEED.

int rdseed16_step (uint16_t *seed)
{
	unsigned char ok;

	asm volatile ("rdseed %0; setc %1"
		: "=r" (*seed), "=qm" (ok));

	return (int) ok;
}

int rdseed32_step (uint32_t *seed)
{
	unsigned char ok;

	asm volatile ("rdseed %0; setc %1"
		: "=r" (*seed), "=qm" (ok));

	return (int) ok;
}

int rdseed64_step (uint64_t *seed)
{
	unsigned char ok;

	asm volatile ("rdseed %0; setc %1"
		: "=r" (*seed), "=qm" (ok));

	return (int) ok;
}

Code Example 9. Simple RDSEED invocations for 16-bit, 32-bit, and 64-bit values

5         Summary

Intel Data Protection Technology with Secure Key represents a new class of random number generator. It combines a high-quality entropy source with a CSPRNG into a robust, self-contained hardware module that is isolated from software attacks. The resulting random numbers offer excellent statistical qualities, highly unpredictable random sequences, and high performance. Accessible via two simple instructions, RDRAND and RDSEED, the random number generator is also very easy to use. Random numbers are available to software running at all privilege levels, and requires no special libraries or operating system handling.

 

References

1. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. Nishimura, Makoto Matsumoto and Takuji. 1, January 1998, ACM Transactions on Modeling and Computer Simulation, Vol. 8.

2. Z. Gutterman, B. Pinkas, and T. Reinman. Analysis of the Linux Random Number Generator. [Online] March 2006. http://software.intel.com/sites/default/files/m/6/0/9/gpr06.pdf.

3. CVE-2008-0166. Common Vulnerabilities and Exposures. [Online] January 9, 2008. http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-0166.

4. Specification for the Advanced Encryption Standard (AES). [Online] November 26, 2001. http://software.intel.com/sites/default/files/m/4/d/d/fips-197.pdf.

5. Recommendation for Block Cipher Modes of Operation: Three Variants of Ciphertext Stealing for CBC Mode. [Online] October 2010. http://csrc.nist.gov/publications/nistpubs/800-38a/addendum-to-nist_sp800-38A.pdf.

6. Recommendation for Random Number Generation Using Deterministic Random Bit Generators (Revised). [Online] January 2012. http://csrc.nist.gov/publications/nistpubs/800-90A/SP800-90A.pdf.

7. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2: Instruction Set Reference, A-Z. [Online] http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.

8. Intel® Processor Identification and the CPUID Instruction. [Online] April 2012. http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html.

9. Intel® Architecture Instruction Set Extensions Programming Reference. [Online] https://software.intel.com/en-us/intel-isa-extensions.

 

 

 

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others
Copyright© 2014 Intel Corporation. All rights reserved.

How to use the rdrand engine in OpenSSL for random number generation

$
0
0

The OpenSSL* ENGINE API includes an engine specifically for Intel® Data Protection Technology with Secure Key. When this engine is enabled, the RAND_bytes() function will use exclusively use the RDRAND instruction for generating random numbers and will not need to rely on the OS's entropy pool for reseeding. End applications can simply call RAND_bytes(), do not have to invoke RAND_seed() or RAND_add(), and the OpenSSL library will not call RAND_poll() internally.

Download the complete code sample at the bottom of the article.

Enabling the Engine

Enabling the ENGINE is a two-step process:

  1. Initialize the engine
  2. Set the engine as the default for random number generation

The following code snippets show how this is done. A complete sample program, distributed under the Intel Sample Source Code License, can also be downloaded using the link on this page. The sample code is designed to be compiled under Linux*, but can very easily be adapted to a Windows* or OS X* environment.

Initialization

The first step in initializing the engine is to call ENGINE_load_rdrand(). To avoid compile errors it is up the developer to either ensure that have a version of OpenSSL with RDRAND support, or (if developing a software package for distribution as source code) determine prior to compilation whether or not this function exists in the OpenSSL library. On UNIX systems, the latter can be accomplished using the GNU* Autoconf autoconfiguration tool or similar utilities.

Once ENGINE_lad_rdrand() has been called, the developer must then call ENGINE_by_id() to check for the existence of an engine with the name "rdrand". If this function returns NULL, then the rdrand engine was not able to load. The most common cause for this error would be lack of processor support for Intel Data Protection Technology with Secure Key.

ENGINE *engine;

ENGINE_load_rdrand();

engine= ENGINE_by_id("rdrand");
if ( engine == NULL ) {
	fprintf(stderr, "ENGINE_load_rdrand returned %lu\n", ERR_get_error());
	exit(1);
}

With the presence of the RDRAND engine verified, you then call ENGINE_init() to prepare it for use.

if ( ! ENGINE_init(engine) ) {
	fprintf(stderr, "ENGINE_init returned %lu\n", ERR_get_error());
	exit(1);
}

Using the Engine

With the engine properly initialized, you then set it as the default for random number generation. This is done with the ENGINE_set_default() function.
if ( ! ENGINE_set_default(engine, ENGINE_METHOD_RAND) ) {
	fprintf(stderr, "ENGINE_set_default returned %lu\n", ERR_get_error());
	exit(1);
}

Now all that is left is to call RAND_bytes() to generate random numbers.

RAND_bytes(buf, BUFFERSZ);

Cleanup

When your program is done with the engine, it is good form to clean up afterwards to free up any resources that the engine is using.

ENGINE_finish(engine);
ENGINE_free(engine);
ENGINE_cleanup();

Summary

Using the rdrand engine in OpenSSL is a straight-forward process, easily accomplished in just a few lines of C/C++. With it, you can ensure that the output of RAND_bytes() is generated entirely by the RDRAND instruction on supported hardware.

§

*Other names and brands may be claimed as the property of others

Changes to RDRAND integration in OpenSSL

$
0
0

Beginning with the 1.0.1f release of OpenSSL the RDRAND engine is no longer loaded by default*.

The impact of this from the users' and developers' perspectives is that, for the near future, random numbers obtained from the RAND_bytes() function will come from OpenSSL's software-based PRNG rather than directly from the RDRAND instruction. This change was made in part due to concerns in the user community around OpenSSL's reliance on a single entropy source for random number generation.  Essentially, OpenSSL did not want to force end developers and users onto a single entropy source without them being aware that it was happening.

Unfortunately, this change also means that RDRAND is no longer OpenSSL's random number generator by default.

What does this change mean for OpenSSL and RDRAND in the future?

This change to the handling of the RDRAND engine is permanent. OpenSSL will not load the RDRAND engine by default from version 1.0.1f and on.

RDRAND still has a place in OpenSSL's future, however. In upcoming 1.0.2 releases, the PRNG within OpenSSL will actually call upon both the RDRAND and RDSEED instructions (if available) to augment its output. It will do this in two ways, by:

  1. XOR'ing random values obtained from the software PRNG with values obtained from RDRAND (in non-FIPS modes)
  2. Using RDSEED or RDRAND as an entropy source to seed the software PRNG (in both FIPS and non-FIPS modes)

This new behavior of the built-in PRNG has been committed into the 1.0.2-stable branch of the OpenSSL source tree. There's no release version or date targeted for these changes, but they will likely not make it in the initial 1.0.2 release which was feature-frozen before these changes were implemented. It's also not known whether or not this will be back-ported into the 1.0.1 branch.

What should developers do in the mean time?

What's a developer to do given these changes in the OpenSSL architecture? Well, it depends on your goals, and how much work you want to do.

Explicitly enable the RDRAND engine

If you want to guarantee that the RAND_bytes() function returns numbers that come from the RDRAND instruction, then you can explicitly enable the RDRAND engine as described in the article, "How to use the rdrand engine in OpenSSL for random number generation". By doing this you are essentially returning to the pre-1.0.1f behavior of OpenSSL through manual intervention. You'll also  ensure this behavior going forward, no matter which version of OpenSSL the end user has installed on their system.

Do nothing

Alternatively, you can do nothing, and accept the default behavior of OpenSSL's RAND_bytes() function. This means the source of the random numbers from the RAND_bytes() will change with the OpenSSL version. If the end user upgrades from 1.0.1e to 1.0.1f then they will no longer benefit from the RDRAND instruction. When they upgrade to a 1.0.2 release that re-enables RDRAND support, they will  still get the OpenSSL PRNG but it will mix in values from RDRAND.

Fortunately, this interim period where RDRAND is not integrated into the RAND_bytes() function by default should not last for long.

 

§

* Strictly speaking, it's loaded but not enabled. It will still show up in the list of engines if you run "openssl engine".


Accelerating SSL Load Balancers with Intel® Xeon® v3 Processors

$
0
0

Examining the Impact of the MULX Instruction on HAProxy* Performance

One of the key components of a large datacenter or cloud deployment is the load balancer. Providing a service with high-availability requires multiple, redundant servers, transparent failover, the ability to distribute load evenly across them and of course the appearance of being a single server to the outside world, especially when negotiating SSL sessions. This is the role that the SSL-terminating load balancer is designed to fill, and it is a demanding one: every incoming session must be accepted, SSL-terminated, and transparently handed off to a back-end server system as quickly as possible since the load balancer is a concentration point and potential bottleneck for incoming traffic. This case study examines the impact of the Intel® Xeon® E5 v3 processor family and the Advanced Vector Extensions 2 instructions on the SSL handshake, and how the AVX2-optimized algorithms inside open OpenSSL* can significantly increase the load capacity of the OpenSource load balancer, HAproxy*.

Background

The goal of this case study is to examine the impact of code optimized for the Intel Xeon v3 line of processors on the performance of the haproxy load balancer.

One of the new features introduced with this processor family is the AVX2 instructions, which expand the AVX integer commands to 256 bits. This is relevant to haproxy in SSL mode because public key cryptography algorithms make heavy use of large integer arithmetic, and the larger registers allow for more efficient execution since they can store larger values. Accelerating this arithmetic on the server directly impacts the performance of the SSL handshake: the faster it can be performed, the more handshakes the server can handle, and the  more connections per second that can be SSL-terminated and handed off to back-end servers.

The Test Environment

The performance limits of HAProxy were tested for various TLS cipher suites by generating a large number of parallel connection requests, and repeating those connections as fast as possible for a total of two minutes. At the end of those two minutes, the maximum latency across all requests was examined, as was the resulting connection rate that sent to the HAProxy server. The number of simultaneous connections was adjusted between runs to find the maximum connection rate that HAProxy could sustain for the duration without session latencies exceeding 2 seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?”, which concluded that two seconds is the maximum acceptable delay in loading a small web page.

HAProxy v1.5.8 was installed on a pre-production, two-socket Intel Xeon server system populated with two pre-production E5-2697 v3 processors clocked at 2.60 GHz with Turbo on, running Ubuntu* Server 13.10. Each E5 processor had 14 cores for a total of 28 hardware threads, and with Hyper-Threading enabled the system was capable of 56 threads in total. Total system RAM was 64 GB.

HAProxy is a popular, feature-rich, and high-performance open source load balancer and reverse proxy for TCP applications, with specific features designed for handling HTTP sessions. Beginning with version 1.5, HAProxy includes native SSL support on both sides of the proxy. More information on HAProxy can be found at http://www.haproxy.org/.

The SSL capabilities for HAProxy were provided by the OpenSSL library. OpenSSL is an Open Source library that implements the SSL and TLS protocols in addition to general purpose cryptographic functions. The 1.0.2 branch, in beta as of this writing, is enabled for the Intel Xeon v3 processor and supports the MULX instruction in many of its public key cryptographic algorithms. More information on OpenSSL can be found at http://www.openssl.org/.

The server load was generated by up to six client systems as needed, a mixture of Xeon E5 and Xeon E5 v2 class hardware. Each system was connected to the HAProxy server with one or more 10 Gbit direct connect links. The server had two 4x10 Gbit network cards, and two 2x10 Gbit network cards. Two of the clients had 4x10 Gbit cards, and the remaining four had a single 10 Gbit NIC.

The network diagram for the test environment is shown in Figure 1.

Figure 1. Test network diagram.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an open source utility that is included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to client CPU demands, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPU’s, and their network interfaces, and then collate the results.

The Test Plan

The goal of the test was to determine the maximum load in connections per second that HAProxy could sustain over 2-minutes of repeated, incoming connection requests, and to compare the Xeon v3 optimized code against previous generation code that does not contain the enhancements. For this purpose, two versions of HAProxy were built: one against the optimized 1.0.2-beta3 release, and one against the unoptimized 1.0.1g release.

To eliminate as many outside variables as possible, all incoming requests to HAProxy were for its internal status page, as configured by the monitor-uri parameter in its configuration file. This meant HAProxy did not have to depend on any external servers, networks or processes to handle the client requests. This also resulted in very small page fetches so that the TLS handshake dominated the session time.

To further stress the server, the keep-alive function was left off in Apache Benchmark, forcing all requests to establish a new connection to the server and negotiate their own sessions.

The key exchange algorithms that were tested are given in Table 1.

Table 1. Selected key exchange algorithms

Key ExchangeCertificate Type
RSARSA, 2048-bit
DHE-RSARSA, 2048-bit
ECDHE-RSARSA, 2048-bit
ECDHE-ECDSAECC, NIST P-256
  

Since the bulk encryption and cryptographic signing were not a significant part of the session, these were fixed at AES with a 128-bit key and SHA-1, respectively. Varying AES key size, AES encryption mode, or SHA hashing scheme would not have an impact on the results.

Tests for each cipher were run for the following hardware configurations:

  • 2 cores enabled (1 core per socket), Hyper-Threading off
  • 2 cores enabled (1 core per socket), Hyper-Threading on
  • 28 cores enabled (all cores, both sockets), Hyper-threading off
  • 28 cores enabled (all cores, both sockets), Hyper-Threading on

Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that HAProxy performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the performance of a system with many cores.

The all-core runs test the full capabilities of the system, show how well the performance scales to a many-core system and also introduces the possibility of system resource limits beyond just CPU utilization.

System Configuration and Tuning

HAProxy was configured to operate in multi-process mode, with one worker for each physical thread on the system. Because the ECDHE-ECDSA ciphers can be used with either RSA or ECC certificates, it was configured to use one or the other as needed.

An excerpt from the configuration file, haproxy.conf, is shown in Figure 2.

global
       daemon
       pidfile /var/run/haproxy.pid
       user haproxy
       group haproxy
       crt-base /etc/haproxy/crt
       # Adjust to match the physical number of threads
       # including threads available via Hyper-Threading
       nbproc 56
       tune.ssl.default-dh-param 2048

defaults
       mode http
       timeout connect 10000ms
       timeout client 30000ms
       timeout server 30000ms

frontend http-in
       # Uncomment one or the other to choose your certificate type
       #bind :443 ssl crt rsa/combined-rsa.crt
       bind :443 ssl crt ecc/combined-ecc.crt
       monitor-uri /test
       default_backend servers

Figure 2. Excerpt from HAProxy configuration

To support the large number of simultaneous connections, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

*               soft    nofile          150000
*               hard    nofile          180000

Figure 3. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

# Increase system IP port limits to allow for more connections

net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_window_scaling = 1

# number of packets to keep in backlog before the kernel starts
# dropping them
net.ipv4.tcp_max_syn_backlog = 3240000

# increase socket listen backlog
net.ipv4.tcp_max_tw_buckets = 1440000

# Increase TCP buffer sizes
net.core.rmem_default = 8388608
net.core.wmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_rmem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 16777216 16777216 16777216

Figure 4. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated load-balancer and SSL/TLS terminator.

No other adjustments were made to the stock Ubuntu 13.10 server image.

Results

Results for the 2-core and 28-core runs follow. Because all tests were run on the same hardware all performance improvements are due solely to the algorithmic optimizations for the Xeon v3 processor.

Two Cores

The raw two-core results are shown in Figure 5 and the performance deltas are in Figure 6.

Figure 5. HAProxy performance on a 2-core system with Hyper-Threading off

Figure 6. Performance gains due to Xeon v3 optimizations

These results show significant improvements across all ciphers tested, ranging from 26% to nearly 255%. They also make a compelling argument for using ECC ciphers on Xeon E5 v3 class systems: with an enabled version of OpenSSL, just two cores are able to handle a staggering 6,500 connections per second using an ECDHE-ECDSA key exchange, which is more than 2.6x the performance of a straight RSA. ECDHE-RSA performs roughly on part with straight RSA. Both of these ciphers offer the cryptographic property of perfect forward secrecy.

Figure 7. Performance gains from enabling Hyper-Threading

Gains from enabling Hyper-Threading were very modest. The optimized algorithms are structured to use the execution resources as much as possible which does not leave much room for the additional threads.

Twenty-eight Cores

The 28-core tests look at the scalability of these performance gains to a many-core deployment. In an ideal world the values would scale linearly with the core count such that a 28-core system would have 14x the performance of a 2-core system.

The raw results are shown in Figure 8 and Figure 9. The ECDHE-ECDSA handshake once again leads the pack, with haproxy achieving over 38,000 connections per second. Of note, though, is that this maximum occurred with only a 67% average CPU utilization on the server, implying that the performance tests ran up against software or system resource limits that were not CPU-bound. For all other ciphers, the performance tests were able to achieve an average CPU utilization above 98%.

This is clearly shown in Table 2, where the performance scaling was between 8.0 and 8.8 for all handshake protocols except ECDHE-ECDSA, which was only 5.8.

Table 2. Performance scaling of the 28-core system over the 2-core system

CipherScaling
AES128-SHA (RSA)8.6
DHE-RSA-AES128-SHA (RSA)8.7
ECDHE-RSA-AES128-SHA (RSA)8.8
ECDHE-ECDSA-AES128-SHA (ECC)5.8
  

Figure 8. HAProxy performance on a 28-core system with Hyper-Threading off

Figure 9. Performance gains due to Xeon v3 optimizations

Effects of HT were more dramatic in the 28-core case. While the % gains were a little higher for the RSA and ECDHE-RSA key exchanges, DHE-RSA and ECDSA actually saw a performance penalty, with the ECDSA loss being rather significant. This is likely related to the same software or resource limit that impacts the non-Hyper-Threading case. In this run, the ECDSA cipher was only able to achieve an average CPU utilization of 41%. For all other ciphers, average CPU utilization was above 95%.

Figure 10. Performance impact from enabling Hyper-Threading

Conclusions

The optimizations for the Xeon E5 v3 processor result in significant performance gains for the haproxy load balancer using the selected ciphers. Each key exchange algorithm realized some benefit, ranging from 26% to nearly 255%. While these percentages are impressive, however, it’s probably the absolute performance figures relative to a straight RSA key exchange that are of greatest interest.

While straight RSA benefits from the Xeon v3 optimizations, the Elliptic curve Diffie-Hellman algorithms see much larger gains. ECDHE with an RSA certificate performs roughly on par with RSA, but move up to ECDHE with ECDSA signing and haproxy can handle over 2.6 times as many connections per second.

Since ECDHE ciphers provide perfect forward secrecy, there is simply no reason for a Xeon E5 v3 server to use a straight RSA key exchange. The ECDHE ciphers not only offer this added security, but the ECDHE-ECDSA cipher adds significantly higher performance. This does come at the cost of added load on the client, but the key exchange in TLS only takes place at session setup time so this is not a significant burden for the client to bear.

On a massively parallel installation with Hyper-Threading enabled, haproxy can maintain connection rates  exceeding 53,000 connections/second using the ECDHE-ECDSA cipher, and do so without fully utilizing the CPU. This is on an out-of-the-box Linux distribution with only minimal system and kernel tuning and it is conceivable that even higher connection rates could be achieved if the system could be optimized to remove the non-CPU bottlenecks. This task was beyond the scope of this study.

Configuring the Apache Web server to use RDRAND in SSL sessions

$
0
0

Starting with the 1.0.2 release of OpenSSL*, RDRAND has been temporarily removed as a random number source. Future releases of OpenSSL will re-incorporate RDRAND, but will employ cryptographic mixing with OpenSSL's own software-based PRNG. While OpenSSL's random numbers will benefit form the quality of RDRAND, it will not have the same performance as RDRAND alone.

If you are running a high-volume SSL web server the speed advantages of RDRAND are probably desirable. An earlier case study on OpenSSL performance when RDRAND was the sole RNG source showed that speedups to the SSL handshake can lead to up to a 1% increase in the number of connections/second that could be handled by an SSL concentrator. Internal testing on the Xeon v3 family of processors shows that RDRAND can give an additional boost to AES bulk encryption as well since random numbers are used to generate IV's.

Fortuately, OpenSSL still provides access to RDRAND as a sole random number source via it's engine API: you just have to turn it on. If you are running an Apache* 2.4 web server with mod_ssl, this is very easy to do. The configuration directive, SSLCryptoDevice, tells mod_ssl which engines to initialize inside of OpenSSL. To enable RDRAND as a sole random number source, you would use the following directive:

SSLCryptoDevice rdrand

Another advantage of doing this is that the digital random number generator that feeds RDRAND is autonomous and self-seeding, so you do not have to supply entropy to OpenSSL. This means you can use the 'builtin' entropy method in mod_ssl, which is the least CPU-intesive and most simplistic method, as the entropy generated by the sources is simply going to be ignored.

SSLRandomSeed startup builtin
SSLRandomSeed connect builtin

Depending on your system architecture, you might even see slightly higher performance from one of the special device files such as /dev/zero.

 

§

 

Bringing SSL to Arduino* on Galileo Through wolfSSL*

$
0
0

The Intel® Galileo development board is an Arduino*-certified development and prototyping board. Built on the Yocto 1.4 Poky Linux* release, Galileo merges an Arduino development environment with a complete Linux-based computer system allowing enthusiasts to incorporate Linux system calls and OS-provided services in their Ardunio sketches.

One long-standing limitation in the Arduino platform has been a lack of SSL support. Without SSL Arduino-based devices are incapable of securely transmitting data using HTTPS, and are thus forced to communicate insecurely using plain HTTP. In order to work around this limitation, devices that participate in the build out of the Internet of Things must rely on secondary devices which serve as bridges to the internet. The Arduino device communicates using HTTP to the bridge, which in turn communicates to the internet-based service using HTTPS.

This solution works well for devices that have a fixed network location, but it does require additional hardware and introduces a concentration point for multiple devices that itself may be vulnerable to attack. For mobile devices that may occasionally rely on public wireless networks, this approach can be entirely impractical. The best level of protection for connected devices is achieved with SSL support directly on the device itself.

On Galileo an Arduino sketch is just a C++ program that is cross-compiled into machine code and executed as a process that is managed by the operating system. That means that it has access to the same system resources as any other compiled program, and specifically that program can be linked against arbitrary, compiled libraries. The implication here is that adding SSL support is as simple as linking the Arduino sketch to an existing SSL library.

This paper examines two methods for adding SSL support to Arduino sketches running on Galileo via the wolfSSL library from wolfSSL, Inc.* (formerly named the CyaSSL library). The wolfSSL library is a lightweight SSL/TLS library that is designed for resource-constrained environments and embedded applications, and is distributed under the GPLv2 license.

This paper looks at two methods for linking the wolfSSL library to an Arduino sketch, but both of them follow the same basic steps:

  1. Build wolfSSL for Yocto
  2. Install the wolfSSL shared library onto your Galileo image
  3. Modify the compile patterns for the Arduino IDE for Galileo
  4. Install the wolfSSL build files onto the system hosting the Arduino IDE

This procedure is moderately complex and does require a firm grasp of the Linux environment, shell commands, software packages and software build procedures, as well as methods of transferring files to and from a Linux system. While this paper does go into some detail on specific Linux commands, it is not a step-by-step instruction manual and it assumes that the reader knows how to manipulate files on a Linux system.

These procedures should work on both Galileo and Galileo 2 boards.

Method 1: Dynamic linking

In the dynamic linking method the Arduino sketch is dynamically linked with the shared object library, libwolfssl.so. This method is the easiest to program for since the sketch just calls the library functions directly.

There are disadvantages to this approach, however:

  • The Arduino IDE for Galileo uses a single configuration for compiling all sketches, so the linker will put a reference to libwolfssl.so in the resulting executable whether or not it’s needed by a sketch. This is not a problem if the target Galileo system has the wolfSSL library installed on it, but if any sketch is compiled for another system that does not have the library then those sketches will not execute.
  • The system hosting the Arduino IDE for Galileo must have the cross-compiled wolfSSL library installed into the Arduino IDE build tree.

Method 2: Dynamic loading

In the dynamic loading method the Arduino sketch is linked with the dynamic linking loader library, libdl. The wolfSSL library and its symbols are loaded dynamically during execution using dlopen() and dlsym(). This method is more tedious to program for since the function names cannot be resolved directly by the linker and must be explicitly loaded by the code and saved as function pointers.

The advantages over the dynamic linking method are:

  • libdl is part of the Galileo SD card image, so arbitrary sketches compiled by the modified IDE will still run on other Galileo systems.
  • The system hosting the Arduino IDE for Galileo only needs to have the wolfSSL header files installed into the build tree.
  • Any dynamic library is available to the Arduino sketch with just this single modification.

The first step in bringing SSL support to the Arduino environment is to build the wolfSSL library for Yocto using uClibc as the C library. This is accomplished using the cross compiler that is bundled with Intel’s Arduino IDE for Linux. This step must be performed on a Linux system.

There have been multiple releases of the IDE since the original Galileo release and any of them will do, but because path names have changed from release to release this document assumes that you will be using the latest build as of this writing, which is the Intel bundle version 1.0.4 with Arduino IDE version 1.6.0.

Software archive:

http://www.intel.com/content/www/us/en/do-it-yourself/downloads-and-documentation.html

Target file:

Arduino Software 1.6.0 - Intel 1.0.4 for Linux

Choose the 32-bit or 64-bit archive, whichever is correct for your Linux distribution.

Configuring the cross-compiler

If you have already used this version of the IDE to build sketches for your Galileo device then it has already been configured properly and you can skip this task.

If you have not built a sketch with it yet, then you will need to run the installation script in order to correctly set the path names in the package configuration files. This script, install_script.sh, is located in the hardware/tools/i586 directory inside the root of your IDE package. Run it with no arguments:

~/galileo/arduino-1.6.0+Intel/hardware/tools/i586$ ./install_script.sh
Setting it up.../tmp/tmp.7FGQfwEaNz/relocate_sdk.sh /nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/relocate_sdk.sh
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/ld-linux-x86-64.so.2
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/libpthread.so.0
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/libnss_compat.so.2
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/librt.so.1
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/libresolv.so.2
…
SDK has been successfully set up and is ready to be used.

The cross-compiler is now ready for use.

Downloading the wolfSSL source

To build the wolfSSL library for Galileo you need to download the source code from wolfSSL, Inc. As of this writing, the latest version is 3.4.0 and is distributed as a Zip archive. Unzip the source into a directory of your choosing.

Building the library

In order to build the library, you must first set up your shell environment to reference the cross compiler. The environment setup files assume a Bourne shell environment so you must perform these steps in an appropriate and compatible shell such as sh or bash. Starting from a clean shell environment is strongly recommended.

First, source the environment setup file from the Intel Arduino IDE. Be sure to use the path to your Intel Arduino IDE instead of the path given in the example:

~/src/wolfssl-3.4.0$ . ~/galileo/arduino-1.6.0+Intel/hardware/tools/i586/environment-setup-i586-poky-linux-uclibc

This step will not generate any output.

Now, you are ready to run the configure script for wolfSSL. It is necessary to provide configure with a number of options in order to properly initialize it for a cross compile.

~/src/wolfssl-3.4.0$ ./configure --prefix=$HOME/wolfssl --host=i586-poky-linux-uclibc \
        --target=i586-poky-linux-uclibc

Note that you must supply absolute paths to the configure script, and cannot use ~ as a shortcut for your home directory. Use the $HOME shell variable instead.

The --prefix option tells build system where to install the library. Since you won’t actually be installing the library on this system, any directory will do. This example shows it going in $HOME/wolfssl.

The --host and --target options tell the build system that this will be a cross-compile, targeting the architecture identified as i586-poky-linux-uclibc.

The configure script will generate a lot of output. When it finishes, assuming there are no errors, you can build the software using “make”.

~/src/wolfssl-3.4.0$ make
make[1]: Entering directory `/nfs/users/johnm/src/wolfssl-3.4.0'
  CC       wolfcrypt/test/testsuite_testsuite_test-test.o
  CC       examples/client/testsuite_testsuite_test-client.o
  CC       examples/server/testsuite_testsuite_test-server.o
  CC       examples/client/tests_unit_test-client.o
  CC       examples/server/tests_unit_test-server.o
  CC       wolfcrypt/src/src_libwolfssl_la-hmac.lo
  CC       wolfcrypt/src/src_libwolfssl_la-random.lo
…
  CCLD     examples/client/client
  CCLD     examples/echoserver/echoserver
  CCLD     testsuite/testsuite.test
  CCLD     tests/unit.test
make[1]: Leaving directory `/nfs/users/johnm/src/wolfssl-3.4.0'

And then install it to the local/temporary location via “make install”:

~/src/wolfssl-3.4.0$ make install

Your library will now be in the directory you specified to the --prefix option of configure, in the lib subdirectory:

~/src/wolfssl-3.4.0$ cd $HOME/wolfssl/lib~/wolfssl/lib$ ls -CFs
total 188
  4 libwolfssl.la*    0 libwolfssl.so.0@        4 pkgconfig/
  0 libwolfssl.so@  180 libwolfssl.so.0.0.0*

You’re now ready to install the wolfSSL library onto Galileo.

There are two general approaches for installing the wolfSSL package onto Galileo: the first is to copy the files directly to the Galileo filesystem image, and the second is to copy the files onto a running Galileo system over a network connection. In either case, however, you do need to know which image you are running on your system, the SD-Card Linux image, or the IoT Developer Kit image.

For Galileo running the SD-Card Linux image

The SD-Card Linux image is the original system image for Galileo boards. It is a very minimal system image which is less than 312 MB in size. It lacks development tools (e.g., there is no compiler) and advanced Linux utilities. As of this writing, the latest version of the SD-Card image is 1.0.4.

Software archive:

http://www.intel.com/content/www/us/en/do-it-yourself/downloads-and-documentation.html

Target file:

SD-Card Linux Image (SDCard.1.0.4.tar.bz2)

Both installation methods are discussed below, but installing directly to the Galileo filesystem image is preferred because you have more powerful utilities at your disposal.

Installing wolfSSL to the filesystem image

This method is easier and less error-prone than the other since you have file synchronization tools available to you, and you don’t have the added complexities of networking. All that is necessary is to mount the Galileo filesystem image as a filesystem on the build machine and then you can use rsync to copy the wolfSSL package into place. You can either copy this file to your build system, or mount the microSD card with the image directly on your Linux system using a card reader.

In the Galileo SD Card filesystem tree, the main Galileo filesystem image is called image-full-galileo-clanton.ext3 and it can be mounted using the loop device. Create a mount point (directory) on your build system—the example below uses /mnt/galileo—and then use the mount command to mount it:

~/wolfssl$ cd /mnt/mnt$ sudo mkdir galileo/mnt$ mount –t ext3 –o loop /path/to/image-full-galileo-clanton.ext3 /mnt/galileo

The Galileo filesystem should now be visible as /mnt/galileo.

Use rsync to copy the shared library and its symlinks into place. They should be installed into /usr/lib on your Galileo system:

/mnt$ rsync –a $HOME/wolfssl/lib/lib* /mnt/galileo/usr/lib

Be sure to replace $HOME/wolfSSL with the actual location of your local wolfSSL build.

Installing wolfSSL over the network

For this method, the Galileo system must be up and running with an active network connection and you will need to know its IP address. Because Galileo lacks file synchronization utilities such as rsync, files will have to be copied using tar to ensure that symbolic links are handled correctly.

First, use cd to switch to the lib subdirectory of your local wolfSSL build.

~/wolfssl$ cd $HOME/wolfssl/lib

Now use tar to create an archive of the shared library and its symlinks, and the copy it to Galileo with scp.

~/wolfssl/lib$ tar cf /tmp/wolfssl.tar lib*~/wolfssl/lib$ cd /tmp/tmp$ scp wolfssl.tar root@192.168.1.2:/tmp
root@192.168.1.2’s password:

Be sure to enter the IP address of your Galileo instead of the example.

Now log in to your Galileo device and untar the archive:

/tmp$ ssh root@192.168.1.2
root@192.168.1.2’s password:
root@clanton:~# cd /usr/libroot@clanton:/usr/lib# tar xf /tmp/wolfssl.tar

For Galileo running the IoT Developer Kit image

The IoT Developer Kit image is a much larger and more traditional Linux system image which includes developer tools and many useful system utilities and daemons. It is distributed as a raw disk image which includes both FAT32 and ext3 disk partitions, and it must be direct-written to an SD card.

Software archive:

https://software.intel.com/en-us/iot/downloads

Target file:

iotdk-galileo-image.bz2

Both installation methods are discussed below.

As of this writing, you also need to replace the uClibc library on your Developer Kit image with the one bundled with your Intel Arduino IDE. Due to differences in the build procedure used for these two copies of the library, not all of the symbols that are exported in the IDE version are present in the Developer Kit version and that can lead to runtime crashes of Arduino sketches. The wolfSSL library, in particular, introduces a dependency on one of these symbols that is missing from the Developer Kit’s build of uClibc, and if you do not replace the library on the Galileo system attempts to use libwolfssl will fail.

Installing wolfSSL to the filesystem image

This method is easiest if you connect an SD card reader to your Linux system. Since the Developer Kit image contains an ext3 partition, most Linux distributions will automatically mount it for you, typically under /media or /mnt. Use the df command with the -T option to help you determine the mount point.

~$ df -T | grep ext3
/dev/sde2      ext3        991896  768032    172664  82% /media/johnm/048ce1b1-be13-4a5d-8352-2df03c0d9ed8

In this case, the mount point is /media/johnm/048ce1b1-be13-4a5d-8352-2df03c0d9ed8:

~$ /bin/ls -CFs /media/johnm/048ce1b1-be13-4a5d-8352-2df03c0d9ed8
total 96
 4 bin/   4 home/          4 media/          4 proc/    4 sys/   4 www/
 4 boot/  4 lib/           4 mnt/            4 run/     4 tmp/
 4 dev/   4 lib32/         4 node_app_slot/  4 sbin/    4 usr/
 4 etc/   16 lost+found/   4 opt/            4 sketch/  4 var/

The libraries used by Arduino sketches are kept in /lib32. Use cd to change to that directory and copy the wolfSSL shared libraries and their symlinks into this directory using rsync in order to preserve the symbolic links.

~/wolfssl$ cd /path-to-mountpoint/lib32lib32$ rsync –a $HOME/wolfssl/lib/lib* .

Be sure to replace path-to-mountpoint with the actual mount point for your SD card’s Galileo filesystem.

Now, you need to replace the Developer Kit’s uClibc library with the one from your Intel Arduino IDE package. Instead of removing it or overwriting it, the following procedure will simply rename it, effectively disabling the original copy of the library but without permanently deleting it:

lib32$ mv libuClibc-0.9.34-git.so libuClibc-0.9.34-git.so.distlib32$ cp ~/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/i586-poky-linux-uclibc/lib/libuClibc-0.9.34-git.so .

Remember to use your actual path to your Intel Arduino IDE in place of the example one.

Installing wolfSSL over the network

For this method, the Galileo system must be up and running with an active network connection and you will need to know its IP address. Because Galileo lacks file synchronization utilities such as rsync, files will have to be copied using tar to ensure that symbolic links are handled correctly.

First, use cd to switch to the lib subdirectory of your local wolfSSL build.

~/wolfssl$ cd $HOME/wolfssl/lib

Now use tar to create an archive of the shared library and its symlinks, and the copy it to Galileo with scp.

~/wolfssl/lib$ tar cf /tmp/wolfssl.tar lib*~/wolfssl/lib$ cd /tmp/tmp$ scp wolfssl.tar root@192.168.1.2:/tmp
root@192.168.1.2’s password:

Be sure to enter the IP address of your Galileo instead of the example.

Now log in to your Galileo device and untar the archive:

/tmp$ ssh root@192.168.1.2
root@192.168.1.2’s password:
root@quark:~# cd /lib32root@quark:/lib32# tar xf /tmp/wolfssl.tar

Next, you need to replace the Developer Kit’s uClibc library with the one from your Intel Arduino IDE package. Instead of removing it or overwriting it, the following procedure will simply rename it, effectively disabling the original copy of the library but without permanently deleting it (this will also prevent the actively running sketch from crashing):

root@quark:/lib32$ mv libuClibc-0.9.34-git.so libuClibc-0.9.34-git.so.dist

Log out of your Galileo system and use scp to copy the library from your Intel Arduino IDE to your Galileo:

~$ scp ~/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/i586-poky-linux-uclibc/lib/libuClibc-0.9.34-git.so root@192.168.1.2:/lib32

Remember to use your actual path to your Intel Arduino IDE in place of the example one, and your Galileo’s IP address.

To compile sketches that want to use the wolfSSL library you need to modify the compile patterns for the Arduino IDE for Galileo. The specific modification that is necessary depends on the method you have chosen for linking to libwolfssl, but no matter the method compile patters live inside of hardware/intel/i586-uclibc for the Intel 1.0.4 with Arduino IDE 1.5.3 and later.

Modifying the compile patterns

The file that holds your compile patterns is named platform.txt.

Locating the compile patterns file

You will be editing the line “recipe.c.combine.pattern”, which looks similar to this:

## Combine gc-sections, archives, and objects
recipe.c.combine.pattern="{compiler.path}{compiler.c.elf.cmd}" {compiler.c.elf.flags} -march={build.mcu} -o "{build.path}/{build.project_name}.elf" {object_files} "{build.path}/{archive_file}""-L{build.path}" -lm -lpthread

Dynamic linking

If you are using the dynamic linking method, then you need to tell the linker to add libwolfssl to the list of libraries when linking the executable. Add -lwolfssl to the end of the line.

## Combine gc-sections, archives, and objects
recipe.c.combine.pattern="{compiler.path}{compiler.c.elf.cmd}" {compiler.c.elf.flags} -march={build.mcu} -o "{build.path}/{build.project_name}.elf" {object_files} "{build.path}/{archive_file}""-L{build.path}" -lm –lpthread -lwolfssl

Be sure not to add any line breaks.

Dynamic loading

In the dynamic loading method, you need to tell the linker to add the dynamic loader library to the list of libraries. Add -ldl to the end of the line.

## Combine gc-sections, archives, and objects
recipe.c.combine.pattern="{compiler.path}{compiler.c.elf.cmd}" {compiler.c.elf.flags} -march={build.mcu} -o "{build.path}/{build.project_name}.elf" {object_files} "{build.path}/{archive_file}""-L{build.path}" -lm -ldl

Be sure not to add any line breaks.

The last step before you can compile sketches is to install the wolfSSL build files into the Arduino IDE for Galileo build tree. For the 1.6.0 release, the build tree is in hardware/tools/i586/i586-poky-linux-uclibc. In there you will find a UNIX-like directory structure containing directories etc, lib, usr, and var.

Installing the wolfSSL header files

Whether you are using the dynamic loading or dynamic linking method you will need to have the wolfSSL header files installed where the Arduino IDE can find them so that you can include them in your sketches with:

#include <wolfssl/ssl.h>

You can find the header files in the local installation of wolfSSL that you created in Step 1, in include subdirectory. For backwards compatability reasons, the wolfSSL distribution includes header files in include/cyassl and include/wolfssl.

The wolfSSL header files must but installed into usr/include:

Copying the wolfssl header files into the includes directory

Installing the wolfSSL libraries

If you are using the dynamic linking method, then you must also install the cross-compiled libraries into usr/lib. You can skip this step if you are using the dynamic loading method.

The libraries are in the local installation that was created in Step 1, inside the lib directory. From there copy:

libwolfssl.la
libwolfssl.so
libwolfssl.so.*

All but one of the shared object files will be symlinks, but it is okay for them to be copied as just regular files.

Installing the wolfssl library files into lib

The following example sketches show how to interact with the wolfSSL library using both the dynamic linking and dynamic loading methods. They perform the same function: connect to a target web server and fetch a web page using SSL. The page source is printed to the Arduino IDE for Galileo’s serial console.

These sketches are licensed under the Intel Sample Source Code license. In addition to browsing the source here, you can download them directly.

Dynamic linking example

/*
Copyright 2015 Intel Corporation All Rights Reserved.

The source code, information and material ("Material") contained herein is owned
by Intel Corporation or its suppliers or licensors, and title to such Material
remains with Intel Corporation or its suppliers or licensors. The Material
contains proprietary information of Intel or its suppliers and licensors. The
Material is protected by worldwide copyright laws and treaty provisions. No part
of the Material may be used, copied, reproduced, modified, published, uploaded,
posted, transmitted, distributed or disclosed in any way without Intel's prior
express written permission. No license under any patent, copyright or other
intellectual property rights in the Material is granted to or conferred upon
you, either expressly, by implication, inducement, estoppel or otherwise. Any
license under such intellectual property rights must be express and approved by
Intel in writing.

Include any supplier copyright notices as supplier requires Intel to use.

Include supplier trademarks or logos as supplier requires Intel to use,
preceded by an asterisk. An asterisked footnote can be added as follows: *Third
Party trademarks are the property of their respective owners.

Unless otherwise agreed by Intel in writing, you may not remove or alter this
notice or any other notice embedded in Materials by Intel or Intel's suppliers
or licensors in any way.
*/

#include <LiquidCrystal.h>
#include <dlfcn.h>
#include <wolfssl/ssl.h>
#include <Ethernet.h>
#include <string.h>

const char server[]= "www.example.com"; // Set this to a web server of your choice
const char req[]= "GET /Main_Page HTTP/1.0\r\n\r\n"; // Get the root page

int repeat;

int wolfssl_init ();
int client_send (WOLFSSL *, char *, int, void *);
int client_recv (WOLFSSL *, char *, int, void *);

LiquidCrystal lcd(8, 9, 4, 5, 6, 7);
void *handle;

EthernetClient client;


WOLFSSL_CTX *ctx= NULL;
WOLFSSL *ssl= NULL;
WOLFSSL_METHOD *method= NULL;

void setup() {
	Serial.begin(9600);
	Serial.println("Initializing");

	lcd.begin(16,2);
	lcd.clear();

	if ( wolfssl_init() == 0 ) goto fail;

	Serial.println("OK");

	// Set the repeat count to a maximum of 5 times so that we aren't
	// fetching the same URL over and over forever.

	repeat= 5;
	return;

fail:
	Serial.print("wolfSSL setup failed");
	repeat= 0;
}

int wolfssl_init ()
{
	char err[17];

	// Create our SSL context

	method= wolfTLSv1_2_client_method();
	ctx= wolfSSL_CTX_new(method);
	if ( ctx == NULL ) return 0;

	// Don't do certification verification
	wolfSSL_CTX_set_verify(ctx, SSL_VERIFY_NONE, 0);

	// Specify callbacks for reading to/writing from the socket (EthernetClient
	// object).

	wolfSSL_SetIORecv(ctx, client_recv);
	wolfSSL_SetIOSend(ctx, client_send);

	return 1;
}

int client_recv (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int i= 0;

	// Read a byte while one is available, and while our buffer isn't full.

	while ( client.available() > 0 && i < sz) {
		buf[i++]= client.read();
	}

	return i;
}

int client_send (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int n= client.write((byte *) buf, sz);
	return n;
}

void loop() {
	char errstr[81];
	char buf[256];
	int err;

	// Repeat until the repeat count is 0.

	if (repeat) {
		if ( client.connect(server, 443) ) {
			int bwritten, bread, totread;

			Serial.print("Connected to ");
			Serial.println(server);

			ssl= wolfSSL_new(ctx);
			if ( ssl == NULL ) {
				err= wolfSSL_get_error(ssl, 0);
				wolfSSL_ERR_error_string_n(err, errstr, 80);
				Serial.print("wolfSSL_new: ");
				Serial.println(errstr);
			}

			Serial.println(req);
			bwritten= wolfSSL_write(ssl, (char *) req, strlen(req));
			Serial.print("Bytes written= ");
			Serial.println(bwritten);

			if ( bwritten > 0 ) {
				totread= 0;

				while ( client.available() || wolfSSL_pending(ssl) ) {
					bread= wolfSSL_read(ssl, buf, sizeof(buf)-1);
					totread+= bread;

					if ( bread > 0 ) {
						buf[bread]= '\0';
						Serial.print(buf);
					} else {
						Serial.println();
						Serial.println("Read error");
					}
				}

				Serial.print("Bytes read= ");
				Serial.println(bread);
			}

			if ( ssl != NULL ) wolfSSL_free(ssl);

			client.stop();
			Serial.println("Connection closed");
		}

		--repeat;
	}

	// Be polite by sleeping between iterations

	delay(5000);
}

Dynamic loading example

/*
Copyright 2015 Intel Corporation All Rights Reserved.

The source code, information and material ("Material") contained herein is owned
by Intel Corporation or its suppliers or licensors, and title to such Material
remains with Intel Corporation or its suppliers or licensors. The Material
contains proprietary information of Intel or its suppliers and licensors. The
Material is protected by worldwide copyright laws and treaty provisions. No part
of the Material may be used, copied, reproduced, modified, published, uploaded,
posted, transmitted, distributed or disclosed in any way without Intel's prior
express written permission. No license under any patent, copyright or other
intellectual property rights in the Material is granted to or conferred upon
you, either expressly, by implication, inducement, estoppel or otherwise. Any
license under such intellectual property rights must be express and approved by
Intel in writing.

Include any supplier copyright notices as supplier requires Intel to use.

Include supplier trademarks or logos as supplier requires Intel to use,
preceded by an asterisk. An asterisked footnote can be added as follows: *Third
Party trademarks are the property of their respective owners.

Unless otherwise agreed by Intel in writing, you may not remove or alter this
notice or any other notice embedded in Materials by Intel or Intel's suppliers
or licensors in any way.
*/

#include <dlfcn.h>
#include <wolfssl/ssl.h>
#include <Ethernet.h>
#include <string.h>

/*
Set this to the location of your wolfssl shared library. By default you
shouldn't need to specify a path unless you put it somewhere other than
/usr/lib (SD-Card image) or /lib32 (IoT Developer Kit image).
*/
#define WOLFSSL_SHLIB_PATH "libwolfssl.so"

const char server[]= "www.example.com"; // Set this to a web server of your choice
const char req[]= "GET / HTTP/1.0\r\n\r\n"; // Get the root page
int repeat;

int wolfssl_dlload ();
int wolfssl_init ();
int client_send (WOLFSSL *, char *, int, void *);
int client_recv (WOLFSSL *, char *, int, void *);

void *handle;

EthernetClient client;


WOLFSSL_CTX *ctx= NULL;
WOLFSSL *ssl= NULL;
WOLFSSL_METHOD *method= NULL;

typedef struct wolfssl_handle_struct {
	WOLFSSL_METHOD *(*wolfTLSv1_2_client_method)();
	WOLFSSL_CTX *(*wolfSSL_CTX_new)(WOLFSSL_METHOD *);
	void (*wolfSSL_CTX_set_verify)(WOLFSSL_CTX *, int , VerifyCallback);
	int (*wolfSSL_connect)(WOLFSSL *);
	int (*wolfSSL_shutdown)(WOLFSSL *);
	int (*wolfSSL_get_error)(WOLFSSL *, int);
	void (*wolfSSL_ERR_error_string_n)(unsigned long, char *, unsigned long);
	WOLFSSL *(*wolfSSL_new)(WOLFSSL_CTX *);
	void (*wolfSSL_free)(WOLFSSL *);
	void (*wolfSSL_SetIORecv)(WOLFSSL_CTX *, CallbackIORecv);
	void (*wolfSSL_SetIOSend)(WOLFSSL_CTX *, CallbackIORecv);
	int (*wolfSSL_read)(WOLFSSL *, void *, int);
	int (*wolfSSL_write)(WOLFSSL *, void *, int);
	int (*wolfSSL_pending)(WOLFSSL *);
} wolfssl_t;

wolfssl_t wolf;

void setup() {
	Serial.begin(9600);
	Serial.println("Initializing");

	if ( wolfssl_dlload() == 0 ) goto fail;
	if ( wolfssl_init() == 0 ) goto fail;

	// Set the repeat count to a maximum of 5 times so that we aren't
	// fetching the same URL over and over forever.

	repeat= 5;
	return;

fail:
	Serial.print("wolfSSL setup failed");
	repeat= 0;
}

int wolfssl_init ()
{
	char err[17];

	// Create our SSL context

	method= wolf.wolfTLSv1_2_client_method();
	ctx= wolf.wolfSSL_CTX_new(method);
	if ( ctx == NULL ) return 0;

	// Don't do certification verification
	wolf.wolfSSL_CTX_set_verify(ctx, SSL_VERIFY_NONE, 0);

	// Specify callbacks for reading to/writing from the socket (EthernetClient
	// object).

	wolf.wolfSSL_SetIORecv(ctx, client_recv);
	wolf.wolfSSL_SetIOSend(ctx, client_send);

	return 1;
}

int wolfssl_dlload ()
{
	// Dynamically load our symbols from libwolfssl.so

	char *err;

	// goto is useful for constructs like this, where we need everything to succeed or
	// it's an overall failure and we abort. If just one of these fails, print an error
	// message and return 0.

	handle= dlopen(WOLFSSL_SHLIB_PATH, RTLD_NOW);
	if ( handle == NULL ) {
		err= dlerror();
		goto fail;
  }

	wolf.wolfTLSv1_2_client_method= (WOLFSSL_METHOD *(*)()) dlsym(handle, "wolfTLSv1_2_client_method");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_CTX_new= (WOLFSSL_CTX *(*)(WOLFSSL_METHOD *)) dlsym(handle, "wolfSSL_CTX_new");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_CTX_set_verify= (void (*)(WOLFSSL_CTX* , int , VerifyCallback)) dlsym(handle, "wolfSSL_CTX_set_verify");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_connect= (int (*)(WOLFSSL *)) dlsym(handle, "wolfSSL_connect");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_get_error= (int (*)(WOLFSSL *, int)) dlsym(handle, "wolfSSL_get_error");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_ERR_error_string_n= (void (*)(unsigned long, char *, unsigned long)) dlsym(handle, "wolfSSL_ERR_error_string_n");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_new= (WOLFSSL *(*)(WOLFSSL_CTX *)) dlsym(handle, "wolfSSL_new");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_free= (void (*)(WOLFSSL *)) dlsym(handle, "wolfSSL_free");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_SetIORecv= (void (*)(WOLFSSL_CTX *, CallbackIORecv)) dlsym(handle, "wolfSSL_SetIORecv");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_SetIOSend= (void (*)(WOLFSSL_CTX *, CallbackIORecv)) dlsym(handle, "wolfSSL_SetIOSend");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_read= (int (*)(WOLFSSL *, void *, int)) dlsym(handle, "wolfSSL_read");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_write= (int (*)(WOLFSSL *, void *, int)) dlsym(handle, "wolfSSL_write");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_pending= (int (*)(WOLFSSL *)) dlsym(handle, "wolfSSL_pending");
	if ( (err= dlerror()) != NULL ) goto fail;

	Serial.println("OK");

	return 1;

fail:
	Serial.println(err);
	return 0;
}

int client_recv (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int i= 0;

	// Read a byte while one is available, and while our buffer isn't full.

	while ( client.available() > 0 && i < sz) {
		buf[i++]= client.read();
	}

	return i;
}

int client_send (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int n= client.write((byte *) buf, sz);
	return n;
}

void loop() {
	char errstr[81];
	char buf[256];
	int err;

	// Repeat until the repeat count is 0.

	if (repeat) {
		if ( client.connect(server, 443) ) {
			int bwritten, bread, totread;

			Serial.print("Connected to ");
			Serial.println(server);

			ssl= wolf.wolfSSL_new(ctx);
			if ( ssl == NULL ) {
				err= wolf.wolfSSL_get_error(ssl, 0);
				wolf.wolfSSL_ERR_error_string_n(err, errstr, 80);
				Serial.print("wolfSSL_new: ");
				Serial.println(errstr);
			}

			Serial.println(req);
			bwritten= wolf.wolfSSL_write(ssl, (char *) req, strlen(req));
			Serial.print("Bytes written= ");
			Serial.println(bwritten);

			if ( bwritten > 0 ) {
				totread= 0;

				while ( client.available() || wolf.wolfSSL_pending(ssl) ) {
					bread= wolf.wolfSSL_read(ssl, buf, sizeof(buf)-1);
					totread+= bread;

					if ( bread > 0 ) {
						buf[bread]= '\0';
						Serial.print(buf);
					} else {
						Serial.println();
						Serial.println("Read error");
					}
				}

				Serial.print("Bytes read= ");
				Serial.println(totread);
			}

			if ( ssl != NULL ) wolf.wolfSSL_free(ssl);

			client.stop();
			Serial.println("Connection closed");
		}

		--repeat;
	}

	// Be polite by sleeping between iterations

	delay(5000);
}

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:  http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2012 Intel Corporation. All rights reserved.

§

AES-GCM Encryption Performance on Intel® Xeon® E5 v3 Processors

$
0
0

This case study examines the architectural improvements made to the Intel® Xeon® E5 v3 processor family in order to improve the performance of the Galois/Counter Mode of AES block encryption. It looks at the impact of these improvements on the nginx* web server when backed by the OpenSSL* SSL/TLS library. With this new generation of Xeon processors, web servers can obtain significant increases in maximum throughput by switching from AES in CBC mode with HMAC+SHA1 digests to AES-GCM.

Background

The goal of this case study is to examine the impact of the microarchitecture improvements made in the Intel Xeon v3 line of processors on the performance of an SSL web server. Two significant enhancements relating to encryption performance were latency reductions in the Intel® AES New Instructions (Intel® AES-NI) instructions and a latency reduction in the PCLMULQDQ instruction. These changes were designed specifically to increase the performance of the Galois/Counter Mode of AES, commonly referred to as AES-GCM.

One of the key features of AES-GCM is that the Galois field multiplication that is used for message authentication can be computed in parallel with the block encryption. This permits a much higher level of parallelization than is possible with chaining modes of AES, such as the popular Cipher Block Chaining (CBC) mode. The performance gain of AES-GCM over AES-CBC with HMAC+SHA1 digests was significant even on older generation CPU’s such as the Xeon v2 family, but the architectural improvements to the Xeon v3 family further widen the performance gap.

Figure 1 shows the throughput gains realized from OpenSSL’s speed tests by choosing the aes-128-gcm EVP over aes-128-cbc-hmac-sha1 on both Xeon E5 v2 and Xeon E5 v3 systems. The hardware and software configuration behind data test is given in Table 1. What this shows is that AES-GCM outperforms AES-CBC with HMAC+SHA1 on Xeon E5 v2 by as much as 2.5x, but on Xeon E5 v3 that jumps to nearly 4.5x. The performance gap between GCM and CBC nearly doubles from Xeon E5 v2 to v3.


Table 1. Hardware and software configurations for OpenSSL speed tests.

In order to assess how this OpenSSL raw performance translates to SSL web server throughput, this case study looks at the maximum throughput achievable by the nginx web server when using these two encryption ciphers.


Figure 1. Relative OpenSSL 1.0.2a speed results for the aes-128-gcm and aes-128-cbc-hamc-sha1 EVP's on Xeon E5 v2 and v3 processors

The Test Environment

The performance limits of nginx were tested for the two ciphers by generating a large number of parallel connection requests, and repeating those connections as fast as possible for a total of two minutes. At the end of those two minutes, the maximum latency across all requests was examined along with the resulting throughput. The number of simultaneous connections was adjusted between runs to find the maximum throughput that nginx could achieve for the duration without connection latencies exceeding 2 seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?”, which concluded that two seconds is the maximum acceptable delay in loading a small web page.

Nginx  was installed on a pre-production, two-socket Intel Xeon server system populated with two production E5-2697 v3 processors clocked at 2.60 GHz with Turbo on and Hyper-Threading off. The system was running Ubuntu* Server 13.10. Each E5 processor had 14 cores for a total of 28 hardware threads. Total system RAM was 64 GB.

The SSL capabilities for nginx were provided by the OpenSSL library. OpenSSL is an Open Source library that implements the SSL and TLS protocols in addition to general purpose cryptographic functions and the 1.0.2 branch is optimized for the Intel Xeon v3 processor. More information on OpenSSL can be found at http://www.openssl.org/. The tests in this case study were made using 1.0.2-beta3 as the production release was not yet available at the time these tests were run.

The server load was generated by up to six client systems as needed; a mixture of Xeon E5 and Xeon E5 v2 class hardware. Each system was connected to the nginx server with multiple 10 Gbit direct connect links. The server had two 4x10 Gbit network cards, and two 2x10 Gbit network cards. Two of the clients had 4x10 Gbit cards, and the remaining four had a single 10 Gbit NIC.

The network diagram for the test environment is shown in Figure 1.


Figure 2. Test network diagram.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an Open Source utility that is included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to client CPU demands, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPU’s, and their network interfaces, and then collate the results.

The Test Plan

The goal of the tests were to determine the maximum throughput that nginx could sustain over 2-minutes of repeated, incoming connection requests for a target file, and to compare the results for the AES128-SHA cipher to those of the AES128-GCM-SHA256 cipher on the Xeon E5-2697 v3 platform. Note that in the GCM cipher suites, the _SHA suffix refers to the SHA hashing function used as the Pseudo Random Function algorithm in the cipher, in this case SHA-256.


Table 2. Selected TLS Ciphers

Each test was repeated for a fixed target file size, starting at 1 MB and increasing by powers of four up to 4 GB, where 1 GB = 1024 MB, 1 MB = 1024 KB, and 1 KB = 1024 bytes. The use of files 1MB and larger minimized the impact of the key exchange on the session throughput. Keep-alives were disabled so that each connection resulted in fetching a single file.

Tests for each cipher were run for the following hardware configurations:

  • 2 cores enabled (1 core per socket)
  • 4 cores enabled (2 cores per socket)
  • 8 cores enabled (4 cores per socket)
  • 16 cores enabled (8 cores pre socket)

Hyper-threading was disabled in all configurations. Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that nginx performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the projected performance of a system with many cores.

The many-core runs test the scalability of the system, and also introduces the possibility of system resource limits beyond just CPU utilization.

System Configuration and Tuning

Nginx was configured to operate in multi-process mode, with one worker for each physical thread on the system.

An excerpt from the configuration file, nginx.conf, is shown in Figure 3.

worker_processes 16; # Adjust this to match the core count

events {
        worker_connections 8192;
        multi_accept on;
}

Figure 3. Excerpt from nginx configuration

To support the large number of simultaneous connections that might occur at the smaller target file sizes, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

*               soft    nofile          150000
*               hard    nofile          180000

Figure 4. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

# Increase system IP port limits to allow for more connections

net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_window_scaling = 1

# number of packets to keep in backlog before the kernel starts
# dropping them
net.ipv4.tcp_max_syn_backlog = 3240000

# increase socket listen backlog
net.ipv4.tcp_max_tw_buckets = 1440000

# Increase TCP buffer sizes
net.core.rmem_default = 8388608
net.core.wmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_rmem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 16777216 16777216 16777216

Figure 5. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated SSL/TLS web server.

No other adjustments were made to the stock Ubuntu 13.10 server image.

Results

The maximum throughput in Gbps achieved for each cipher by file size is shown in Figure 6. At the smallest file size, 1 MB, the differences between the GCM and CBC ciphers are modest because the SSL handshake dominates the transaction but for the larger file sizes the GCM cipher outperforms the CBC cipher from 2 to 2.4x. Raw GCM performance is roughly 8 Gbps/core. This holds true up until 8 cores, when the maximum throughput is no longer CPU-limited. This is the point where other system limitations prevent the web server from achieving higher transfer rates, more dramatically revealed in the 16-core case. Here, the both ciphers see only a modest increase in throughput, though the CBC cipher realizes a larger benefit.

This is more clearly illustrated in Figure 7 which plots the maximum CPU utilization of nginx during the 2-minute run for each case. In the 2- and 4- cases, %CPU for both ciphers is in the high 90’s and in the 8-core case it ranges from 80% for large files to 98% for smaller ones.


Figure 6. Maximum nginx throughput by file size for given core counts

It is the 16-core case where system resource limits begin to show significantly, along with large differences in the performance of the ciphers themselves. Here, the total throughput has only increased incrementally from the 8-core case, and it’s apparent that this is because the additional cores simply cannot be put to use. The GCM cipher is using only 50 to 70% of the available CPU. It’s also clear that the GCM cipher is doing more—specifically, providing a great deal more throughput than the CBC cipher—with significantly less compute power.


Figure 7. Max %CPU utilization at maximum nginx throughput

Conclusions

The architectural changes to the Xeon v3 family of processors have a significant impact on the performance of the AES-GCM cipher, and they provide a very compelling argument for choosing it over AES-CBC with HMAC+SHA1 digests for SSL/TLS web servers.

In the raw OpenSSL speed tests, the performance gap between GCM and CBC nearly doubles from the Xeon E5 v2 family. In the web server tests, the use of the AES-GCM cipher led to roughly 2 to 2.4x the throughput of the AES-CBC cipher, and absolute data rates of about 8 Gbps/core. The many-core configurations are able to achieve total data transfer rates in excess of 50 Gbps before hitting system limits. This level of throughput was achieved on an off-the-shelf Linux installation with very minimal system tuning.

It may be necessary to continue to support AES-CBC with HMAC+SHA1 digests due to the large number of clients that cannot take advantage of AES-GCM, but AES-GCM should certainly be enabled on web servers running on Xeon v3 family processors in order to provide the best possible performance, not to mention the added security, that this cipher offers.

 

Performance of Multibuffer AES-CBC on Intel® Xeon® Processors E5 v3

$
0
0

This paper examines the impact of the multibuffer enhancements to OpenSSL* on the Intel® Xeon® processor E5 v3 family when performing AES block encryption in CBC mode. It focuses on the performance gains seen by the Apache* web server when managing a large number of simultaneous HTTPS requests using the AES128-SHA and AES128-SHA256 ciphers, and how they stack up against the more modern AES128-GCM-SHA256 cipher. With the E5 v3 generation of processors, web servers such as Apache can obtain significant increases in maximum throughput when using multibuffer-enabled algorithms for CBC mode encryption.

 

Background

One of the performance-limiting characteristics of the CBC mode of AES encryption is that it is not parallelizable. Each block of plaintext in the stream to be encrypted depends upon the encryption of the previous block as an input, as shown in Figure 1. Only the first block has no such dependency and substitutes an initialization vector, or IV, in its place.

Figure 1. CBC mode encryption
Figure 1. CBC mode encryption

Mathematically, this is defined as:

Ci = Ek (Pi XOR Ci-1)

Where

C0 = IV

From this definition, it’s clear that there are no opportunities for parallelization within the algorithm for the encryption of a single data stream. To perform encryption on any given data block, Pn, it must first be XOR’d with the previous cipher block, and that means that all previous blocks must be encrypted, in order, from 1 to n. The CBC mode of encryption is a classically serial operation.

The multibuffer approach introduced in “Processing Multiple Buffers in Parallel for Performance” describes a procedure for parallelizing algorithms such as CBC that are serial in nature. Operations are interleaved such that the latencies incurred while processing one data block are masked by active operations on another, independent data block. Through careful ordering of the machine instructions, the multiple execution units within a CPU core can be utilized to process more than one data stream in parallel within a single thread.

Multibuffer solutions generally require a job scheduler and an asynchronous application model, but the OpenSSL library is a synchronous framework so a job scheduler is not an option. The solution in this case is to break down the application buffer into TLS records of equal size that can be processed in parallel due to the explicit IV as described in “Improving OpenSSL Performance”. For a web server, the implication is that only file downloads from server to client—page fetches, media downloads, etc.—will see a performance boost. File uploads to the server will not.

 

The Test Environment

The performance limits of Apache were tested by generating a large number of parallel connection requests and repeating those connections as rapidly as possible for a total of two minutes. At the end of those two minutes, the maximum connection latency across all requests was examined along with the resulting throughput. The number of simultaneous connections was adjusted between runs to find the maximum throughput that apache could achieve for the duration without connection latencies exceeding two seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?” which concluded that two seconds is the maximum acceptable delay in loading a small web page.

Apache was installed on a pre-production, two-socket Intel Xeon processor-based server system populated with two production E5-2697 v3 processors clocked at 2.60 GHz with Intel® Turbo Boost Technology on and Intel® Hyper-Threading Technology (Intel® HT Technology) off. The system was running SUSE Linux* Enterprise Server 12. Each E5 processor had 14 cores for a total of 28 hardware threads. Total system RAM was 64 GB. Networking for the server load was provided by a pair of Intel® Ethernet Converged Network Adapters, XL710-QDA2 (Intel® Ethernet CNA XL710-QDA2).

The SSL capabilities for Apache were provided by the OpenSSL library. OpenSSL is an open source library that implements the SSL and TLS protocols in addition to general-purpose cryptographic functions. The 1.0.2 release is optimized for the Intel Xeon processor v3 and contains the multibuffer enhancements. For more information on OpenSSL see http://www.openssl.org/. The tests in this case study were made using 1.0.2a.

Two versions of OpenSSL 1.0.2a were built so that the performance of the multibuffer enhancements could be compared to unenhanced code on the same release. Multibuffer support was forcibly removed by defining the preprocessor symbol OPENSSL_NO_MULTIBLOCK:

$ ./Configure –DOPENSSL_NO_MULTIBLOCK options

The server load was generated by up to six client systems as needed, a mixture of Intel Xeon processor E5 v2 and E5 v3 class hardware. Load generators were connected to the Apache server through 40 Gbps links. Two of the clients had a single Intel Ethernet CNA XL710-QDA2 card and were connected to one of the dual-port Intel Ethernet CNA XL710-QDA2 cards on the server. The remaining four load clients each had a single port 10 Gbit card and their bandwidth was aggregated via a 40 Gbit switch.

The network diagram for the test environment is shown in Figure 2.

Test network diagram
Figure 2. Test network diagram.

Edit: While the Intel Ethernet CNA XL710-QDA2 cards have two ports, the second port is mostly present for redundancy. As a PCIe 3.0 x8 card, it is limited by the bus to a maximum theoretical bandwidth of about 60 Gbps. In practice, actual throughput tends to be closer to 50 Gbps. Total theoretical maximum bandwidth of this network configuration is about 100 Gbps.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an open source utility included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to network bandwidth and client CPU limitations, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPUs, and their network interfaces, and then collate the results. Loads were distributed based on a simple weighting system that accounted for an individual client’s network bandwidth and processing power.

 

The Test Plan

The goal of the tests was to determine the maximum throughput that Apache could sustain throughout two minutes of repeated, incoming connection requests for a target file, and to compare the results for the multibuffer-enabled version of OpenSSL against the unenhanced version. Multibuffer benefits CBC mode encryption, so the AES128-SHA and AES128-SHA256 ciphers were chosen for analysis.

The secondary goal was to compare the multibuffer results against the more modern GCM mode of block encryption. For that comparison the AES128-GCM-SHA256 cipher was chosen.

This resulted in the following cases:

  • AES128-SHA, multibuffer ON
  • AES128-SHA, multibuffer OFF
  • AES128-SHA256, multibuffer ON
  • AES128-SHA256, multibuffer OFF
  • AES128-GCM-SHA256

For each case, performance tests were repeated for a fixed target file size, starting at 1 MB and increasing by powers of four up to 4 GB, where 1 GB = 1024 MB, 1 MB = 1024 KB, and 1 KB = 1024 bytes. The use of 1 MB files and larger minimized the impact of the key exchange on the session throughput. Keep-alives were disabled so that each connection resulted in fetching a single file.

Tests for each cipher were run for the following hardware configurations:

  • 2 cores enabled (1 core per socket)
  • 4 cores enabled (2 cores per socket)
  • 8 cores enabled (4 cores per socket)
  • 16 cores enabled (8 cores pre socket)
  • 28 (all) cores enabled (14 cores per socket)

Intel HT Technology was disabled in all configurations. Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that Apache performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the projected performance of a system with many cores.

The many-core runs test the scalability of the system and introduce the possibility of system resource limits beyond just CPU utilization.

 

System Configuration and Tuning

Apache was configured to use the event Multi-Processing Module (MPM), which implements a hybrid multi-process, multi-threaded server. This is Apache’s highest performance MPM and the default on systems that support both multiple threads and thread-safe polling.

To support the large number of simultaneous connections that might occur at the smaller target file sizes, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

Excerpt from /etc/security/limits.conf
Figure 3. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

Excerpt from /etc/sysctl.conf
Figure 4. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated TLS web server.

No other adjustments were made to the stock SLES 12 server image.

 

System Performance Limits

Before running the tests, the throughput limit of the server system was explored using unencrypted HTTP. Tests on the same target file sizes, with all cores active and the same constraint of a 2-second maximum connection latency, saw a maximum achievable throughput of just over 77 Gbps (with very little CPU utilization).

Edit: A cursory investigation suggested that there may have been a configuration issue with the dual-port NIC, leading to less than the expected maximum throughput for the adapter which should have been closer to about 50 Gbps. In these tests, bandwidth of this card reached a maximum of about 42 Gbps. In-depth debugging was not done however due to time constraints.

Results

The maximum throughputs in Gbps achieved for the AES128-SHA and AES128-SHA256 ciphers by file size are shown in Figure 5 and Figure 6. At the smallest file size, 1 MB, the multibuffer enhancements result in about a 44% gain on average, and at the larger file sizes this gain is as high as 115%. This holds true up through 8 cores. At 16 cores, the gains begin to drop off as the throughput reaches the ceiling of 77 Gbps. In the 28-core case, the unenhanced code has nearly reached the throughput ceiling, but with significantly higher CPU utilization as shown in Figure 7.

Maximum throughput on Apache* server using AES128-SHA cipher
Figure 5. Maximum throughput on Apache* server by file size for given core counts using the AES128-SHA cipher

The AES128-SHA256 cipher shows even larger gains for the multibuffer enhanced code, with about a 65% improvement for 1 MB files and jumping to 130% at larger file sizes. Because the SHA256 hashing is more CPU intensive, the overall throughput is significantly lower than the SHA1-based cipher. A side effect of this lower performance is that the multibuffer code scales through the 16-core case, and the unenhanced code never reaches the throughput ceiling even when all 28 cores are active.

Maximum throughput on Apache* using the AES128-SHA256 cipher
Figure 6. Maximum throughput on Apache* server by file size for given core counts using the AES128-SHA256 cipher

 AES128-SHA cipher and 28-cores
Figure 7. Maximum CPU utilization for Apache* server: AES128-SHA cipher and 28-cores

The performance of the multibuffer-enhanced ciphers is compared to the AES128-GCM-SHA256 cipher in Figure 8. The GCM cipher outperforms both of the multibuffer-enhanced ciphers, though AES128-SHA stays within about 20% of the GCM throughput. The AES128-SHA256 cipher is the lowest performer due to the larger CPU demands of the SHA256 hashing.

Maximum throughput on Apache* server comparing CBC + multibuffer to GCM encryption
Figure 8. Maximum throughput on Apache* server by file size for given core counts, comparing CBC + multibuffer to GCM encryption

 

Conclusions

The multibuffer enhancements to AES CBC encryption in OpenSSL 1.0.2 provide a significant performance boost, yielding over 2x performance in some cases. Web sites that need to retain these older ciphers in their negotiation list can achieve performance that is nearly on par with GCM for page and file downloads.

Web site administrators considering moving to AES128-SHA256 to obtain the added security from the SHA256 hashing will certainly see a significant performance boost from multibuffer, but if at all possible they should switch to GCM, which offers significantly higher performance due to its design.

The DRNG Library and Manual

$
0
0

Download

Download the static binary libraries, source code, and documentation:

This software is distributed under the Intel Sample Source Code license.

About

This is the DRNG Library, a project designed to bring the rdrand and rdseed instructions to customers who would traditionally not have access to them.

The "rdrand" and "rdseed" instructions are available at all privilege levels to any programmer, but tool chain support on some platforms is limited. One goal of this project is to provide access to these instructions via pre-compiled, static libraries. A second level goal is to provide a consistent, small, and very easy-to-use API to access the "rdrand" and "rdseed" instruction for various sizes of random data.

The source code and build system are provided for the entire library allowing the user to make any needed changes, or build dynamic versions, for incorporation into their own code.

Getting Started

For ease of use, this library is distributed with static libraries for Microsoft* Windows* and Microsoft* Visual Studio*, Linux Ubuntu* 14.10, and OS X* Yosemite*. The library can also be built from source, and requires the Visual Studio with the Intel(r) C++ Compiler or Visual Studio 2013 on Windows, or GNU* gcc* on Linux and OS X*. See the Building section for more details.

Once the static library is compiled, it is simply a matter of linking in the library with your code and including the header in the header search path to use the library. Linking the static library is beyond the scope of this documentation, but for demonstration, a simple Microsoft* Visual Studio* project is included, named test, as well as a simple project with Makefile for Linux or OS X. Source for the test is in main.c, and the test project on Linux can uses the top-level Makefile. The rdrand.sln solution includes the test project.

Rdrand is only supported on 3rd generation Intel(r) Core processors and beyond, and Rdseed is only supported on 5th generation Intel(r) Core processors and Core M processors and beyond. It makes sense to determine whether or not these instructions are supported by the CPU and this is done by examining the appropriate feature bits after calling cpuid. To ease use the library automatically handles this, and stores the results internally and transparently to the end user of the library. This result is stored in global data, and is thread-safe, given that if one thread of execution supports rdrand, they all will. Users may find it more practical, however, to call theRdRand_isSupported() and RdSeed_isSupported() functions when managing multiple potential code paths in an application.

The API was designed to be as simple and easy-to-use as possible, and consists of these functions:

int rdrand_16(uint16_t* x, int retry);
int rdrand_32(uint32_t* x, int retry);
int rdrand_64(uint64_t* x, int retry);

int rdseed_16(uint16_t* x, int retry_count);
int rdseed_32(uint32_t* x, int retry_count);
int rdseed_64(uint64_t* x, int retry_count);

int rdrand_get_n_64(unsigned int n, uint64_t* x);
int rdrand_get_n_32(unsigned int n, uint32_t* x);

int rdseed_get_n_64(unsigned int n, uint64_t* x, unsigned int skip, int max_retries);
int rdseed_get_n_32(unsigned int n, uint32_t* x, unsigned int skip, int max_retries);

int rdrand_get_bytes(unsigned int n, unsigned char *buffer);
int rdseed_get_bytes(unsigned int n, unsigned char *buffer, unsigned into skip, int max_retries);

Each function calls rdrand or rdseed internally for a specific data-size of random data to return to the caller.

The return of these functions states the hardware was not ready (if non-retry specified), success, or that the host hardware doesn't support the desired instruction at all. 

Building

Building the Rdrand Library is supported under Microsoft* Visual Studio 2013*, Linux* and OS X*. Use of the Intel(r) C++ Compiler is optional unless you are using a version of Visual Studio earlier than 2013 on Microsoft Windows*.

To build the library, open the rdrand Microsoft Visual Studio solution (rdrand.sln), and build the project as normal from the build menu. Included are two projects, rdrand the actual library and test, the demonstration program.

On Linux and OS X the build is wrapped with GNU* autoconf*. To build:

$ ./configure
$ make

Release Notes

The DRNG Library is simple and as of this release is functionally complete. There are no known issues.

© 2015 Intel Corporation. All rights reserved.

Presenting at USENIX LISA15 in November

$
0
0

I'll be holding a mini-tutorial session at the USENIX LISA15 conference in Washington, D.C. this coming November. My class, entitled Fundamentals of Data Visualization: Building more Effective Charts and Business Intelligence Dashboards, is targetted at anyone who has to present numerical data in static form, whether it be part of a presentation to management and co-workers or in a business intelligence dashboard. The emphasis is on creating clear graphs and displays that can be parsed quickly and accurately, and which do not mislead the reader. It also covers some visual theory and the physiology of vision, which serve as the foundation for the recommendations and best practices when creating charts and dashboards. I've taught a version of this material internally here at Intel over the past three years.

The idea for this class grew out of my nearly two decades spent as a systems administrator, with the last several of those involved in developing systems that gather and report performance and utilization metrics to internal customers-- design engineers-- and management. I had some formal training on the topic as part of my GIS education, but it is largely an enthusiast pursuit and I am very passionate about the subject.

If you are attending the LISA conference this year, the tutorial is scheduled for Thursday, November 14th from 11-12:30pm.


Presenting at ToorCon 17 in San Diego on October 24th

$
0
0

I'll be speaking at the ToorCon 17 security conference which is taking place at the end of this month, from October 24-25th. My specific talk will be on Saturday at 1pm, and will cover some of the algorithm enhancements that were made to OpenSSL in order to increase the performance of AES-CBC and AES-GCM encryption. These enhancements, function stitching and multibuffer, were mentioned briefly in a white paper on OpenSSL performance that was published here on Developer Zone earlier this year, but I'll go into more technical details on the techniques and what makes them work. These are two very novel approaches to cryptographic optimization, and they have applications beyond just OpenSSL.

I'll also talk about how these enhancements affect real-world, or at least "real-world-like", performance of an SSL/TLS web server, and the methodology used to generate our throughput measurements using freely available tools.

Slides from ToorCon 17 talk on OpenSSL Performance

$
0
0

I've sent the slides from my talk on Improving OpenSSL Performance to the ToorCon staff for distribution, but I am also making them available for download here in PDF format. I chose PDF since many of the slides have extensive notes which go into much greater detail that the slides could, and in many cases even cover detail that I could not put into the 50-minute session on Saturday.

Enabling Multibuffer for AES-CBC Encryption in Apache* and nginx*

$
0
0

The OpenSSL* library v1.0.2 introduces the multibuffer enhancements for AES. Multibuffer significantly increases the performance of CBC mode encryption by processing multiple TLS records in parallel on a single hardware thread. As shown in the white paper Improving OpenSSL Performance, the multibuffer enhancements to AES-CBC encryption can result in a 50 to 100% increase in throughput. This makes it an attractive feature for web site administrators using these ciphers, and this article explains how to enable it under two popular, open source, Linux*-based web servers: Apache* HTTP Server from the Apache Software Foundation, and nginx* from Nginx, Inc.

Multibuffer Basics

Multibuffer works by encrypting the TLS records created during a file transfer in parallel. Because of the overhead involved, however, multibuffer only works when the SSL buffers are at least 16 KB in size (where 1 KB is defined as 1024 bytes). Since the current multibuffer implementations process four data streams in parallel this means that the minimum file size required for multibuffer to engage is 64 KB.

Each TLS record is encrypted separately with CBC encryption which means each record also requires a random IV. Since those TLS records are also processed simultaneously, the multibuffer solution demands random numbers at a higher frequency: all else being equal, multibuffer will consume the same number of random values as the non-multibuffer code path, but do so in less time. In OpenSSL 1.0.2, random numbers are supplied by the RAND_bytes() function which uses OpenSSL's software-based pseudorandom number generator (PRNG), mixed with values obtained from the RDRAND instruction if Intel® Data Protection with Secure Key is supported by the processor. Multibuffer throughput might benefit from using Secure Key as the sole random number provider, freeing up CPU cycles for the AES encryption. This can be accomplished by enabling the rdrand engine in OpenSSL, though this will do so for other areas of OpenSSL that depend on random numbers such as the SSL/TLS handshake, too.

Configuring Apache for Multibuffer

Apache uses dynamically sized SSL buffers that adapt to the file being transferred. As a result, Apache will automatically make use of multibuffer if the target file is 64 KB or larger. No configuration needs to be done.

To configure OpenSSL to use the rdrand engine for Apache, use the OpenSSLConf directive:

SSLCryptoDevice rdrand

More information can be found in the article "Configuring the Apache Web server to use RDRAND in SSL sessions".

Configuring nginx for Multibuffer

nginx uses a fixed SSL buffer size which is configured in the nginx.conf file. In order to enable multibuffer the configuration parameter ssl_buffer_size must be set to 64k or larger:

ssl_buffer_size 64k;

Larger sizes will result in greater throughput, but unfortunately it can have an impact on latency no matter which cipher is used. For small files (ones that would not benefit from multibuffer), this means incurring a small penalty in performance.

Unfortuantely, there is no single magic number that works well for all file sizes. This value should be chosen based on the workloads that are common and appropriate for the server, and administrators are encouraged to test different values to find the one that works best for their environment.

At the current time, nginx does not provide a way to configure specific engines within OpenSSL so there is no way to force the rdrand engine for random number generation.

 

§

 

How to Verify That OpenSSL is Using Multibuffer With Your Web Server

$
0
0

The article "Enabling Multibuffer for AES-CBC Encryption in Apache* and nginx*" describes the procedures for enabling OpenSSL's* multibuffer capability with ciphers that use AES in CBC mode for the block cipher. This article describes how to check if the multibuffer code path is being used so that systems administrators can verify that their web server is operating as expected. Note that this procedure assumes that the web server is running on a Linux*-based system.

What you'll need

  1. The Linux "perf" package. It is available for all major Linux distributions and should be easy to install using the distro's package manager.
  2. A utility that can make numerous HTTPS requests in succession, and which allows you to choose the SSL cipher for the session. This article specifically recommends the Apache Benchmark tool, ab, which is included in the Apache HTTP Server distribution.

Step 1: Run the web server in single-process debug mode

A very simple way to determine which code path OpenSSL is using (which functions are being called) is to use a trace utility to analyze the process. Servers which use a forking model to spawn multiple worker processes make tracing more difficult since it's not known which of the child processes is actively handling the workload during the test run. There are ways to work around this problem, but they involve gathering more data which lowers the overall signal-to-noise ratio. It's easiest to simply put the server into a single process mode and eliminate the ambiguity.

The Apache web server can be run in single process mode using the -X switch:

apachectl start -X

To put the nginx web server in a single process mode you'll need two directives in the configuration file:

master_process off;
worker_processes 1;

You can optionally specify "daemon off" as well if you don't want nginx to detach from the controlling terminal.

Step 2: Determine the process ID of the web server process

This can be done by using ps and grepping for the server process name, or by looking in the pid file created by the web server.

Step 3: Run perf to trace the process

Use the perf record utility to begin tracing the process using the PID obtained from step 2. You may need to run this as root.

# perf record -p pid

perf works by preiodically sampling the process and examining its call stack. The higher the sample rate the more accurate the results will be, but this occurs at the expense of application performance. The defaults are sufficient for the purposes of this procedure since the goal is only to determine whether or not the multibuffer functions are being called, not to attempt performance analysis.

Note that perf will run in the foreground. Use ^C to quit.

Step 4: Fetch a large file multiple times

With perf record still running, fetch a file from the web server that is large enough to trigger the multibuffer code. Multibuffer only engages when the file being transferred is 64 KB in size or larger. It is important to fetch the file multiple times to ensure that perf record can catch the encryption functions during its sample periods. Be sure to specify a cipher that uses AES in CBC mode for the block encryption, and that it is one that is accepted by your server.

This command runs the Apache Benchmark utility to fetch the target file 1000 times in rapid succession, negotiating with the server to use the AES128-SHA cipher defined in OpenSSL. This corresponds to the cipher suite TLS_RSA_WITH_AES_128_CBC_SHA in the RFCs. (A mapping of OpenSSL cipher names to RFC cipher suites can be found at https://testssl.sh/openssl-rfc.mappping.html).

% ab -n 1000 -Z AES128-SHA https://servername/bigfile

Make sure the file requests are successful, and then stop the performance trace by hitting ^C.

Step 5: Generate the trace report

With the trace completed. use the perf report command to generate a report of the process trace:

# perf report

By default, this will send the output to your terminal using a pager such as less or more. If you had to run perf record as root, then you'll need to run perf report as root as well.

Multibuffer Functions

If you see these functions listed in the report, then OpenSSL is using multibuffer for the CBC encryption:

     3.92%    nginx  nginx               [.] sha1_multi_block_avx
     2.19%    nginx  nginx               [.] aesni_multi_cbc_encrypt

Though they only make up a small percentage of the CPU utilization, they should appear near the top of the report (within the first dozen lines).

Stitched Functions

If you see functions similar to these then OpenSSL is using the stitched code path for CBC encryption:

     2.03%    nginx  nginx               [.] sha256_block_data_order_avx2
     0.69%    nginx  nginx               [.] sha1_block_data_order_avx2             

Note that the exact function names will depend on your server's architecture (this example shows the AVX2 code path).

 

§

Accelerating SSL Load Balancers with Intel® Xeon® E5 v4 Processors

$
0
0

Examining the impact of the ADCX, ADOX, and MULX instructions on haproxy performance

One of the key components of a large datacenter or cloud deployment is the load balancer. When it’s a service provider’s goal to establish high availability and low latencies, they will require multiple redundant servers, a transparent failover mechanism between them, and some form of performance monitoring in order to direct traffic to the best available server. This infrastructure must present itself as a single server to the outside world.

This is the role that the SSL-terminating load balancer is designed to fill in a secure web service deployment: every incoming session must be accepted, SSL-terminated, and transparently handed off to a back-end server system as quickly as possible. Unfortunately, this means that the load balancer is a concentration point, and potential bottleneck, for incoming traffic.

This case study examines the impact of the Intel® Xeon® E5 v4 processor family and the ADCX, ADOX, and MULX instructions on the SSL handshake. By using the optimized algorithms inside OpenSSL*, this processor can significantly increase the load capacity of the Open Source load balancer, haproxy*.

Background

The goal of this case study is to examine the impact of code optimized for the Intel Xeon v4 line of processors on the performance of the haproxy load balancer.

The Xeon v4 line of processors adds two new instructions, ADCX and ADOX. These are extensions of the ADC instruction but are designed to support two separate carry chains. They are defined as follows:

adcx dest/src1, src2
adox dest/src1, src2

They differ from the ADC instruction in how they make use of the flags. Both instructions compute the sum of src1 and src2 plus a carry-in, and generate an output sum dest and a carry-out, however the ADCX instruction uses the CF flag for carry-in and carry-out (leaving the OF unchanged), and the ADOX instruction uses the OF flag for carry-in and carry-out (leaving the CF flag unchanged).

These instructions allow the developer to maintain two independent carry chains, which the CPU can process and optimize within a single hardware thread in order to increase instruction-level parallelism.

Combined with the MULX instruction, which was introduced with the Xeon v3 line of processors, certain algorithms for large integer arithmetic can be greatly accelerated. For more information, see the white paper “New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors”.

Public key cryptography is one application that benefits significantly from these enhancements. This case study looks at the impact on the RSA and ECDSA algorithms in the SSL/TLS handshake: the faster the handshake can be performed, the more handshakes the server can handle, and the more connections per second that can be SSL-terminated and handed off to back-end servers.

The Test Environment

The performance limits of haproxy were tested for various TLS cipher suites by generating a large number of parallel connection requests, and repeating those connections as fast as possible for a total of two minutes. At the end of those two minutes, the maximum latency across all requests was examined, as was the resulting connection rate to the haproxy server. The number of simultaneous connections was adjusted between runs to find the maximum connection rate that haproxy could sustain for the duration without session latencies exceeding 2 seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?”, which concluded that two seconds is the maximum acceptable delay in loading a small web page.

In order to make a comparison between the current and previous generation of Xeon processors, haproxy v1.6.3 was installed on the following server systems:

Table 1. Test server configurations

 Server 1Server 2
CPUIntel® Xeon® E5-2697 v3Intel® Xeon® E5-2699 v4
Sockets22
Cores/Socket1422
Frequency2.10 GHz2.10 GHz
Memory64 GB64 GB
Hyper-ThreadingOffOff
Turbo BoostOffOff

haproxy is a popular, feature-rich, and high-performance Open Source load balancer and reverse proxy for TCP applications, with specific features designed for handling HTTP sessions. More information on haproxy can be found at http://www.haproxy.org.

The SSL capabilities for haproxy were provided by the OpenSSL library. OpenSSL is an Open Source library that implements the SSL and TLS protocols in addition to general purpose cryptographic functions. The 1.0.2 branch is enabled for the Intel Xeon v4 processor and supports the ADCX, ADOX, and MULX instructions in selected public key cryptographic algorithms. More information on OpenSSL can be found at http://www.openssl.org.

The server load was generated by up to six client systems as needed, on a mixture of Xeon E5 v2 and Xeon E5 v3 class hardware. All systems were connected together using a 40 Gbps switch.

The high-level network diagram for the test environment is shown in Figure 1.


Figure 1.Test network diagram.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an Open Source utility that is included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to client CPU demands, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPU’s, and their network interfaces, and then collate the results.

The Test Plan

The goal of the test was to determine the maximum load in connections per second that haproxy could sustain over 2-minutes of repeated, incoming connection requests, and to compare the Xeon v4 optimized code (which makes use of the ADCX, ADOX, and MULX instructions) against previous generation code that used a pure AVX2 implementation, running on Xeon v3.

To eliminate as many outside variables as possible, all incoming requests to haproxy were for its internal status page, as configured by the monitor-uri parameter in its configuration file. This meant haproxy did not have to depend on any external servers, networks or processes to handle the client requests. This also resulted in very small page fetches so that the TLS handshake dominated the session time.

To further stress the server, the keep-alive function was left off in Apache Benchmark, forcing all requests to establish a new connection to the server and negotiate their own sessions.

The key exchange algorithms that were tested are given in Table 2.

Table 2. Selected key exchange algorithms

Key ExchangeCertificate Type
ECDHE-RSA-2048RSA, 2048-bit
ECDHE-RSA-4096RSA, 4096 bit
ECDHE-ECDSAECC, NIST P-256

Since the bulk encryption and cryptographic signing were not a significant part of the session, they were fixed at AES with a 128-bit key and SHA-1, respectively. Varying AES key size, AES encryption mode, or SHA hashing scheme would not have an impact on the results.

Tests for each cipher were run with only one active core per socket (two cores active per server). Reducing the systems in this manner, the minimum configuration allowed in the test systems, effectively simulates low-core-count systems and ensures that haproxy performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the performance of a system with many cores.

The ECDHE-ECDSA cipher was tested both at 2 cores and with multiple cores, doubling the core count at the conclusion of each test, topping out at 44 cores (the maximum core count supported by the system). This tested the full capabilities of the system, examining how well the performance scaled to a many-core deployment and also introduced the possibility of system resource limits beyond just CPU utilization.

System Configuration and Tuning

Haproxy was configured to operate in multi-process mode, with one worker for each physical thread on the system.

An excerpt from the configuration file, haproxy.conf, is shown in Figure 2.

global
       daemon
       pidfile /var/run/haproxy.pid
       user haproxy
       group haproxy
       crt-base /etc/haproxy/crt
       # Adjust to match the physical number of threads
       # including threads available via Hyper-Threading
       nbproc 44
       tune.ssl.default-dh-param 2048

defaults
       mode http
       timeout connect 10000ms
       timeout client 30000ms
       timeout server 30000ms

frontend http-in
       bind :443 ssl crt combined-rsa_4096.crt
       # Uncomment to use the ECC certificate
       # bind :443 ssl crt combined-ecc.crt
       monitor-uri /test
       default_backend servers


Figure 2. Excerpt from haproxy configuration

To support the large number of simultaneous connections, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

* soft nofile 200000
* hard nofile 300000

Figure 3. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

# Increase system IP port limits to allow for more connections

net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_window_scaling = 1

# number of packets to keep in backlog before the kernel starts
# dropping them
net.ipv4.tcp_max_syn_backlog = 3240000

# increase socket listen backlog
net.ipv4.tcp_max_tw_buckets = 1440000

# Increase TCP buffer sizes
net.core.rmem_default = 8388608
net.core.wmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_rmem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 16777216 16777216 16777216

Figure 4. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated load-balancer and SSL/TLS terminator.

No other adjustments were made to the stock SLES 12 server image.

Results

The results for the 2-core case are shown in Figure 5 and Figure 6.


Figure 5. Maximum connection rate for haproxy by cipher (2 cores)


Figure 6. haproxy performance increase for Intel® Xeon® E5 v4 over Intel® Xeon® E5 v3 by cipher

All three of these ciphers show significant gains over the previous generation of Xeon processors. The largest increase comes from using RSA signatures with 4096-bit keys. This is because the AVX algorithm for RSA in OpenSSL contains a special-case code path when the key size is 2048 bits. At other key sizes, a generic algorithm is used, and the 4096 bit key results in an over 60% gain in performance when moving from the Xeon v3 family to the Xeon v4 family of processors.

The ECDSA-signed cipher sees a performance boost of over 11%.

In Figure 7, which uses logarithmic scales (base 2), we see how the performance of haproxy scales with the number of CPU cores when using the ECDHE-ECDSA cipher.


Figure 7. Scaling of maximum haproxy performance on Intel® Xeon® E5-2699 v4 for the ECDHE+ECDSA cipher

Performance scales linearly with the core count up to about 16 cores, and starts leveling off shortly after that. Above 16 cores, the CPU’s are no longer 100% utilized indicating that our overall system performance is not CPU-limited. With this many cores active, we have exceeded the performance limits of the interrupt-based network stack. The maximum, sustainable connection rate reached with all 44 cores active is just under 54,000 connections/sec, but the maximum CPU utilization never exceeds 70%.

Conclusions

The optimizations for the Xeon E5 v4 processor result in significant performance gains for the haproxy load balancer using the selected ciphers. Each key exchange algorithm realized some benefit, ranging from 9% to over 60%.

Arguably most impressive, however, is the raw performance achieved by the ECDHE-ECDSA cipher. Since it provides perfect forward secrecy, there is simply no reason for a Xeon E5 v4 server to use a straight RSA key exchange. The ECDHE cipher not only offers this added security, but it adds significantly higher performance. This does come at the cost of added load on the client, but the key exchange in TLS only takes place at session setup time so this is not a significant burden for the client to bear.

On a massively parallel scale, haproxy can maintain connection rates exceeding 53,000 connections/second using the ECDHE-ECDSA cipher, and do so without fully utilizing the CPU. This is on an out-of-the-box Linux distribution with only minimal system and kernel tuning. It is conceivable that even higher connection rates could be achieved if the system could be optimized to remove the non-CPU bottlenecks, a task beyond the scope of this study.

Additional Resources

Accelerating SSL Load Balancers with Intel® Xeon® v3 Processors

About the Author

John Machalas has worked for Intel since 1994 and spent most of those years as a UNIX systems administrator and systems programmer. He is now an application engineer working primarily with security technologies.

Viewing all 77 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>