Introduction to data compression on the web

Optimize the size and speed of your web application thanks to the correct use of data compression.

Introduction

Web compression, that little and misunderstood feature that we constantly use to send and receive data faster and generating a smaller volume of data.

In this article we will see how it works and the different options available to us.

Negotiation

The client and server agree on which compression method to use through negotiation.

Accept-Encoding

When the client requests a web resource from the server, it can tell which compression formats it wants to use. It does this through the Accept-Encoding header (RFC 7231, section 5.3.4), indicating the compression algorithms it accepts. Example:

GET /encrypted-area HTTP/1.1
Host: www.example.com
Accept-Encoding: gzip, deflate

In this case the client tells the server to accept the gzip and deflate compression methods. The order is important as it indicates the client preference (gzip would be the first option to use and deflate the second).

To control the preference, the ;q= modifier can also be used, indicating a qualitative value (q-value). This expresses the priority in weight units ranging from 0 to 1. Example:

GET /encrypted-area HTTP/1.1
Host: www.example.com
Accept-Encoding: deflate;q=0.8, gzip;q=1.0

In this case, even though gzip is second in the list, it would be first in priority.

Content-Encoding

After the request, the response arrives. In this case the server decides which compression method to use. Actually, the server can do as it pleases, but it would normally respect the client's wishes and use the compression algorithm with the highest priority specified by the client. In the above example it would be gzip. If gzip was not available, deflate would be used, and so on. If none of them are available, the server should not compress the content.

Apart from compressing the response body, the server adds the Content-Encoding header (RFC 7231, section 3.1.2.2) in which it specifies the compression algorithm that was finally used. Example:

HTTP/1.1 200 OK
Content-Length: 309
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip

Do not confuse Content-Type with Content-Encoding. Even if the response is compressed, it still has a format, such as text/html, video/mp4 or image/jpg. Therefore, do not forget to specify the Content-Type header to indicate the content of the message, whether it is compressed or not.

zlib

Jean-loup Gailly and Mark Adler created in 1995 the compression library zlib. With the same name was born the zlib format (specified in RFC 1950), which consists of a header, a body compressed by a compression algorithm (usually deflate, although others can be implemented) and finally includes a checksum in Adler-32 format.

Thus, the zlib library produced files in zlib format.

This brief information will be useful to understand the various compression algorithms.

Compression formats

There are a few compression formats but without a doubt gzip dominates the web.

Possible values in the Accept-Encoding and Content-Encoding headers are: deflate, gzip, br, identity, compress and *.

Let's take a look at them.

deflate

The case of deflate is curious as well as confusing.

RFC 2616 defines deflate as:

The "zlib" format defined in RFC 1950 in combination with the "deflate" compression mechanism described in RFC 1951.

Let's go in parts to try to understand it better.

It turns out that according to RFC 2616, deflate is equivalent to the zlib format, compressing the body using the deflate algorithm (i.e. no other algorithm can be used).

Confusing, isn't it? It turns out that deflate is both a format and an algorithm used within that format.

This led to confusion also for the Microsoft engineers who implemented the format, so for years browsers didn't know whether deflate referred to the format or the compression algorithm, so they worked by discarding: if it wasn't one, it was the other.

This ambiguity in defining the zlib/deflate format caused the gzip format to spread rapidly and become the king of web compression.

To understand it better, let's look at the Node.js implementation of the zlib library.

It turns out that we have these two methods:

You may have guessed it by now, but the zlib.Deflate method refers to the deflate format (including the header and checksum), while the zlib.DeflateRaw method refers to the compression algorithm itself, without any header or wrapper of any kind.

How to differentiate Deflate and DeflateRaw

First, let's see what happens when we compress the foo string using both methods:

  • JavaScript
zlib.deflateSync('foo')
<Buffer 78 9c 4b cb cf 07 00 02 82 01 45>
  • JavaScript
zlib.deflateRawSync('foo')
<Buffer 4b cb cf 07 00>

The result of deflate as a format produces a header with bytes 78 9c, a body containing bytes 4b cb cf 07 00 (in common with deflate as an algorithm) and a checksum with the bytes 02 82 01 45.

It is in the header bytes, more specifically in the first byte (78 in our example) where the distinction lies.

A byte is made up of two nibbles. Since 78 is a hexadecimal value, the 7 is one nibble and the 8 is the other nibble. The first being the high (more significant) and the second the low (less significant), although this order depends on the architecture/platform you are on, but let's skip this.

If the low nibble (the second one) is an 8, then we are dealing with the deflate/zlib format, while if it is not an 8, we are dealing with the compression algorithm. It is a rule that always holds.

Therefore, in our example we can clearly see which is a deflate format and which is a deflate compression algorithm (known in the zlib implementation of Node.js as deflateRaw).

gzip

Based on the deflate algorithm, gzip is an open format that compresses the message using the LZ77 algorithm and adds, like the zlib format, a header and a checksum, albeit in CRC-32 format.

The header of the gzip format contains more bytes than the zlib format does (12 bytes more in total). In addition, the checksum using CRC-32 is slower to generate than using Adler-32, but these drawbacks are hardly noticeable with today's powerful devices. This, plus the fact that we will avoid the deflate format/algorithm confusion, has made gzip the ideal choice to adopt when it comes to web compression.

Brotli

Brotli is a compression format created by Google in 2015 to compress fonts for the web in WOFF format, but it quickly found its place in the web compression field because of its advantages. It uses a proprietary compression algorithm based on LZ77.

Its browser compatibility is currently at 62%, more than enough considering that if Brotli were not available, the next one on the list would be used.

Update 2023: Today its compatibility is practically total, standing at 96.45%.

In the HTTP headers Accept-Encoding and Concent-Encoding must be specified with the value br.

As for the advantages over gzip, it promises around 20% extra compression while being practically as fast in compression/decompression.

* (asterisk)

When you use the * value you are indicating to the server that data compression is accepted but no format preference is specified.

This is the default value when we omit the Accept-Encoding header.

identity

The function of identity is to explicitly tell the server that no compression is desired.

compress

This format based on the LZW algorithm fell into disuse due to patent issues and there are hardly any browsers left that support it.

Zstandard

Facebook has an open source project called Zstandard (zstd), which promises better results than Brotli, both in compression and speed. Unfortunately, at the moment it is not supported by any browser so the use of this format is not really valid in web compression. Still, it deserves a mention.

Compressing the compressed

There are file formats that may already be compressed such as some PNG and GIF images or web fonts in WOFF format.

In case this data was already compressed, compressing it again would be a waste of resources, since the final size would be even bigger (adding an unnecessary header and checksum) and also the server would be spending CPU cycles compressing something that isn't going to bring any benefit.

It is also not favorable to compress the same file twice. If something is already compressed, better to leave it that way.

Conclusion

Smartly applying data compression in our web applications can save us a lot of data transfer while getting extra speed for our users.

Regarding the formats, it would be optimal to make use of new options such as Brotli to optimize resources as much as possible. Since web compression degrades easily, there is no reason to leave out older browsers, so we can use gzip as a second option and let the client choose.

You can support me so that I can dedicate even more time to writing articles and have resources to create new projects. Thank you!