Web compression, that little and misunderstood feature that we constantly use to send and receive data faster and generating a smaller volume of data.
In this article we will see how it works and the different options available to us.
The client and server agree on which compression method to use through negotiation.
When the client requests a web resource from the server, it can tell which compression formats it wants to use. It does this through the
Accept-Encoding header (RFC 7231, section 5.3.4), indicating the compression algorithms it accepts. Example:
In this case the client tells the server to accept the
deflate compression methods. The order is important as it indicates the client preference (
gzip would be the first option to use and
deflate the second).
To control the preference, the
;q= modifier can also be used, indicating a qualitative value (q-value). This expresses the priority in weight units ranging from 0 to 1. Example:
In this case, even though
gzip is second in the list, it would be first in priority.
After the request, the response arrives. In this case the server decides which compression method to use. Actually, the server can do as it pleases, but it would normally respect the client's wishes and use the compression algorithm with the highest priority specified by the client. In the above example it would be
gzip was not available,
deflate would be used, and so on. If none of them are available, the server should not compress the content.
Apart from compressing the response body, the server adds the
Content-Encoding header (RFC 7231, section 18.104.22.168) in which it specifies the compression algorithm that was finally used. Example:
Do not confuse
Content-Encoding. Even if the response is compressed, it still has a format, such as
image/jpg. Therefore, do not forget to specify the
Content-Type header to indicate the content of the message, whether it is compressed or not.
Jean-loup Gailly and Mark Adler created in 1995 the compression library zlib. With the same name was born the zlib format (specified in RFC 1950), which consists of a header, a body compressed by a compression algorithm (usually
deflate, although others can be implemented) and finally includes a checksum in Adler-32 format.
Thus, the zlib library produced files in
This brief information will be useful to understand the various compression algorithms.
There are a few compression formats but without a doubt
gzip dominates the web.
Possible values in the
Content-Encoding headers are:
Let's take a look at them.
The case of
deflate is curious as well as confusing.
RFC 2616 defines
The "zlib" format defined in RFC 1950 in combination with the "deflate" compression mechanism described in RFC 1951.
Let's go in parts to try to understand it better.
It turns out that according to RFC 2616,
deflate is equivalent to the
zlib format, compressing the body using the
deflate algorithm (i.e. no other algorithm can be used).
Confusing, isn't it? It turns out that
deflate is both a format and an algorithm used within that format.
This led to confusion also for the Microsoft engineers who implemented the format, so for years browsers didn't know whether
deflate referred to the format or the compression algorithm, so they worked by discarding: if it wasn't one, it was the other.
This ambiguity in defining the
deflate format caused the
gzip format to spread rapidly and become the king of web compression.
To understand it better, let's look at the Node.js implementation of the zlib library.
It turns out that we have these two methods:
- zlib.Deflate: Compress data using deflate.
- zlib.DeflateRaw: Compress data using deflate, and do not append a zlib header.
You may have guessed it by now, but the
zlib.Deflate method refers to the
deflate format (including the header and checksum), while the
zlib.DeflateRaw method refers to the compression algorithm itself, without any header or wrapper of any kind.
First, let's see what happens when we compress the
foo string using both methods:
The result of
deflate as a format produces a header with bytes
78 9c, a body containing bytes
4b cb cf 07 00 (in common with
deflate as an algorithm) and a checksum with the bytes
02 82 01 45.
It is in the header bytes, more specifically in the first byte (78 in our example) where the distinction lies.
A byte is made up of two nibbles. Since
78 is a hexadecimal value, the
7 is one nibble and the
8 is the other nibble. The first being the high (more significant) and the second the low (less significant), although this order depends on the architecture/platform you are on, but let's skip this.
If the low nibble (the second one) is an
8, then we are dealing with the
zlib format, while if it is not an
8, we are dealing with the compression algorithm. It is a rule that always holds.
Therefore, in our example we can clearly see which is a
deflate format and which is a
deflate compression algorithm (known in the zlib implementation of Node.js as
Based on the
deflate algorithm, gzip is an open format that compresses the message using the
LZ77 algorithm and adds, like the
zlib format, a header and a checksum, albeit in CRC-32 format.
The header of the
gzip format contains more bytes than the
zlib format does (12 bytes more in total). In addition, the checksum using CRC-32 is slower to generate than using Adler-32, but these drawbacks are hardly noticeable with today's powerful devices. This, plus the fact that we will avoid the
deflate format/algorithm confusion, has made gzip the ideal choice to adopt when it comes to web compression.
Brotli is a compression format created by Google in 2015 to compress fonts for the web in
WOFF format, but it quickly found its place in the web compression field because of its advantages. It uses a proprietary compression algorithm based on
Its browser compatibility is currently at 62%, more than enough considering that if Brotli were not available, the next one on the list would be used.
Update 2023: Today its compatibility is practically total, standing at 96.45%.
In the HTTP headers
Concent-Encoding must be specified with the value
As for the advantages over
gzip, it promises around 20% extra compression while being practically as fast in compression/decompression.
When you use the
* value you are indicating to the server that data compression is accepted but no format preference is specified.
This is the default value when we omit the
The function of
identity is to explicitly tell the server that no compression is desired.
This format based on the LZW algorithm fell into disuse due to patent issues and there are hardly any browsers left that support it.
Facebook has an open source project called Zstandard (zstd), which promises better results than Brotli, both in compression and speed. Unfortunately, at the moment it is not supported by any browser so the use of this format is not really valid in web compression. Still, it deserves a mention.
There are file formats that may already be compressed such as some
GIF images or web fonts in
In case this data was already compressed, compressing it again would be a waste of resources, since the final size would be even bigger (adding an unnecessary header and checksum) and also the server would be spending CPU cycles compressing something that isn't going to bring any benefit.
It is also not favorable to compress the same file twice. If something is already compressed, better to leave it that way.
Smartly applying data compression in our web applications can save us a lot of data transfer while getting extra speed for our users.
Regarding the formats, it would be optimal to make use of new options such as Brotli to optimize resources as much as possible. Since web compression degrades easily, there is no reason to leave out older browsers, so we can use
gzip as a second option and let the client choose.
You can support me so that I can dedicate even more time to writing articles and have resources to create new projects. Thank you!