Quick Links

There are many file compression utilities, but the one you're guaranteed to find on every Linux distribution is gzip. If you only learn to use one compression tool, it should be gzip .

Related: How Does File Compression Work?

Algorithms and Trees

The gzip data compression tool was written in the early 1990s, and it's still found in every Linux distribution. There are other compression tools available, but no matter which Linux computer you find yourself needing to work on, you'll find gzip on it. So if you know how to use gzip, you're good to go without the need to install anything.

gzip is an implementation of the DEFLATE algorithm which was invented---and patented---by Phil Katz of PKZIP fame. The DEFLATE algorithm improved on earlier compression algorithms which all operated on variations of a theme. The data to be compressed is scanned, and unique strings are identified and added to a binary tree.

The unique strings are allocated a unique ID token by virtue of their position in the tree. The tokens are used to replace the strings in the data and, because the tokens are smaller than the data they replaced, the file is compressed. Substituting the tokens for the original strings re-inflates the data back to its uncompressed state.

Related: Benchmarked: What's the Best File Compression Format?

The DEFLATE algorithm added the twist that the most frequently encountered strings were allocated the smallest tokens and the least frequently encountered strings were allocated larger ones. The DEFLATE algorithm also incorporated ideas from two earlier compression methods, Huffman coding and LZ77 compression.

At the time of writing, the DEFLATE algorithm is nearly three decades old. Three decades ago data storage costs were high and transmission speeds were slow. Data compression was vitally important.

Related: 4 Ways to Free Up Disk Space on Linux

Data storage is much cheaper today, and transmission speeds are orders of magnitude faster. But we have so much more data to store, and the world over people are accessing cloud storage and streaming services. Data compression is still vitally important, even if all you're doing is shrinking something that you need to upload or transmit, or you're trying to claw back some space on a local hard drive.

The gzip Command

The bigger a file is, the better the compression can be. This is because of two reasons. One is there will be many repeated, identical sequences of bytes throughout a large file. The second reason is the list of strings and tokens needs to be stored in the compressed file so that decompression can take place. With a very small file that overhead can wipe out the benefits of the compression. But even with a fairly small file, there's likely to be some reduction in size.

Compressing a File

To compress (or zip) a file, all you need to do is pass the name of the file to the gzip command. We'll check the original size of the file, compress it, and then check the size of the compressed file.

ls -lh calc-sheet.ods

gzip calc-sheet.ods

ls -lh cal-*

Compressing a spreadsheet

The original file, a spreadsheet called "calc-sheet.ods" is 11 KB,  and the compressed file---also known as an archive file---is 9.3 KB. Note that the name of the archive file is the name of the original file with ".gz" appended to it.

The first use of the ls command targets a specific file, the spreadsheet. The second use of ls looks for all files beginning with "calc-" but it only finds the compressed file. That's because, by default, gzip creates the archive file and deletes the original file.

That's not an issue. If you need the original file you can retrieve it from the archive file. But if you prefer to retain the original file, you can use the -k (keep) option.

gzip -k calc-sheet.ods

ls -lh calc-sheet.*

Compressing a file and retaining the original file

This time the original ODS file is retained.

Decompressing a File

To decompress (or unzip) a GZ archive file, use the -d (decompress) option. This will extract the compressed file from the archive and decompress it so that it is indistinguishable from the original file.

ls calc-sheet.*

gzip -d calc-sheet.ods.gz

ls calc-sheet.*

Decompressing a file with gzip

This time, we can see that gzip has deleted the archive file after extracting the original file. To retain the archive file, we need to use the -k (keep) option again, as well as the -d (decompress) option.

ls calc-sheet.*

gzip -d calc-sheet.ods.gz

ls calc-sheet.*

Decompressing a file and retaining the archive file

This time, gzip doesn't delete the archive file.

Related: Why Deleted Files Can Be Recovered, and How You Can Prevent It

Decompressing and Overwriting

If you try to extract a file in a directory where the original file---or a different file with the same---exists,  gzip  will prompt you to choose to abandon the extraction or to overwrite the existing file.

gzip -d text-file.txt.gz

Overwrite prtompt from gzip when the file inthe archive already file exists in the directory

If you know in advance that you're happy to have the file in the directory overwritten by the file from the archive, use the -f (force) option.

gzip -df text-file.txt.gz

Forcin g overwriting of an existing file

The file is overwritten and you're silently returned to the command line.

Compressing Directory Trees

The -r (recursive) option causes gzip to compress the files in an entire directory tree. But the result might not be what you expect.

Here's the directory tree we're going to use in this example. The directories each contain a text file.

tree level1

Test directory tree structure

Let's use gzip on the directory tree and see what happens.

gzip -r level1/

tree level1

Directory structure after running gzip on it

The result is gzip has created an archive file for each text file in the directory structure. It didn't create an archive of the entire directory tree. In fact, gzip can only put a single file in an archive.

We can create an archive file that contains a directory tree and all of its files, but we need to bring another command into play. The tar program is used to create archives of many files, but it doesn't have its own compression routines. But by using the appropriate options with tar, we can cause tar to push the archive file through gzip. That way we get a compressed archive file and a multi-file or multi-directory archive.

tar -czvf level1.tar.gz level1

The tar options are:

  • c: Create an archive.
  • z: Push the files through gzip.
  • v: Verbose mode. Print in the terminal window what tar is up to.
  • f level1.tar.gz: Filename to use for the archive file.
Output from tar working its way through the directory tree

This archives the directory tree structure and all files within the directory tree.

Related: How to Compress and Extract Files Using the tar Command on Linux

Getting Information About Archives

The -l (list) option provides some information about an archive file. It shows you the compressed and uncompressed sizes of the file in the archive, the compression ratio, and the name of the file.

gzip -l leve1.tar.gz

gzip -l text-file.txt.gz

Using the -l list option to see compression statistics for an archive

You can check the integrity of an archive file with the -t (test) option.

gzip -t level1.tar.gz

Testing an archive with the -t option

If all is well, you're silently returned to the command line. No news is good news.

If the archive is corrupt or not an archive you're told about it.

gzip -t not-an-archive.gz

Using the -t option to test a file that isn't an archive

Speed Versus Compression

You can choose to prioritize the speed of creation of the archive or the degree of compression. You do this by providing a number as an option, from -1 through top -9. The -1 option gives the fastest speed at the sacrifice of compression and -9 gives the highest compression at the sacrifice of speed.

Unless you provide one of these options, gzip uses -6.

gzip -1 calc-sheet.ods

ls -lh calc-sheet.ods.gz

gzip -9 calc-sheet.ods

ls -lh calc-sheet.ods.gz

gzip -6 calc-sheet.ods

ls -lh calc-sheet.ods.gz

Using gzip with different priorities for speed and compression

With a file as small as this, we didn't see any significant difference in speed of execution, but there was a small difference in compression.

Interestingly, there is no difference between using level 9 compression and level 6 compression. You can only wring so much compression out of any given file and in this case, that limit was reached with level 6 compression. Cranking it up to 9 brought no further reduction in filesize. With bigger files, the difference between level 6 and level 9 would be more pronounced.

Compressed, Not Protected

Don't mistake compression for encryption or any form of protection. Compressing a file doesn't give it any security or enhanced privacy. Anyone with access to your file can use gzip to decompress it.

Related: List the 10 Largest Files or Directories on Linux