How to Correctly Parse File Names in Bash

Quick Links

The Problem With Correctly Parsing File Names in Bash

The Secret Recipe: NULL Termination

Wrapping up

Bash file naming conventions are very rich, and it is easy to create a script or one-liner which incorrectly parses file names. Learn to parse file names correctly, and thereby ensure your scripts work as intended!

The Problem With Correctly Parsing File Names in Bash

If you have been using Bash for a while, and have been scripting in it's rich Bash language, you will likely have run into some file name parsing issues. Let's take a look at simple example of what can go wrong:

touch 'a

> b'

Setting up a file with a CR character in the filename

Here we created a file which has an actual CR (carriage return) introduced into it by pressing enter after the a. Bash file naming conventions are very rich, and whilst it is in some ways cool we can use special characters like these in a filename, let's see how this file fares when we try to take some actions on it:

ls | xargs rm

The problem trying to handle a filename which includes CR

That did not work. xargs will take the input from ls (via the | pipe), and pass it to rm, but something went amiss in the process!

What went amiss is that the output from ls is taken literally by xargs, and the 'enter' (CR - Carriage Return) within the filename is seen by xargs as an actual termination character, not a CR to be passed onto rm as it should be.

Let's exemplify this in another way:

ls | xargs -I{} echo '{}|'

Showing how xargs will see the CR character as a newline and split data upon it

It is clear: xargs is processing the input as two individual lines, splitting the original filename in two! Even if we were to fix the fix the space issues by some fancy parsing using sed, we would soon run into other issues when we start using other special characters like spaces, backslashes, quotes and more!

touch 'a

touch 'a b'

touch 'ab'

touch 'a"b'

touch "a'b"

All sorts of special characters in filenames

Even if you are a seasoned Bash developer, you may shiver at seeing filenames like this, as it would be very complex, for most common Bash tools, to parse these files correctly. You would have to do all sorts of string modifications to make this work. That is, unless you have the secret recipe.

Before we dive into that, there is one more thing - a must-know - which you can run into when parsing ls output. If you use color coding for directory listings, which is enabled by default on Ubuntu, it is easy to run into another set of ls parsing issues.

These are not really related to how files are named, but rather to how the files are presented as output of ls. The ls output will contain hex codes which represent the color to use to your terminal.

To avoid running into these, simply use --color=never as an option to ls:

ls --color=never.

In Mint 20 (a great Ubuntu derivative operating system) this issue seems fixed, though the issue may still be present in many other or older versions of Ubuntu etc. I have seen this issue as recent as mid August 2020 on Ubuntu.

Even if you do not use color coding for your directory listings, it is possible that your script will run on other systems not owned or managed by you. In such a case, you will want to also use this option to prevent users of such machine from running in the issue described.

Returning to our secret recipe, let's look at how we can make sure that we won't have any issues with special characters in Bash filenames. The solution provided avoids all use of ls, which one would do well to avoid in general, so the color coding issues are not applicable either.

There are still times where ls parsing is quick and handy, but it will always be tricky and likely 'dirty' as soon as special characters are introduced - not to mention insecure (special characters can be used to introduce all sorts of issues).

The Secret Recipe: NULL Termination

Bash tool developers have realized this same problem many years earlier, and have provided us with: NULL termination!

What is NULL termination you ask? Consider how in the examples above, CR (or literally enter) was the main termination character.

We also saw how special characters like quotes, white spaces and back slashes can be used in filenames, even though they have special functions when it comes to other Bash text parsing and modification tools like sed. Now compare this with the -0 option to xargs, from man xargs:

-0, --null Input items are terminated by a null character instead of by white space, and the quotes and backslash are not special (every character is taken literally). Disables the end of file string, which is treated like any other argument. Useful when input items might contain white space, quote marks, or backslashes. The GNU find -print0 option produces input suitable for this mode.

And the -print0 option to find, from man find:

-fprint0 file True; print the full file name on the standard output, followed by a null character (instead of the newline character that -print uses). This allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find output. This option corresponds to the -0 option of xargs.

The True; here means If the option is specified, the following is true;. Also interesting is the two clear warnings given elsewhere in the same manual page:

If you are piping the output of find into another program and there is the faintest possibility that the files which you are searching for might contain a newline, then you should seriously consider using the -print0 option instead of -print. See the UNUSUAL FILENAMES section for information about how unusual characters in filenames are handled.
If you are using find in a script or in a situation where the matched files might have arbitrary names, you should consider using -print0 instead of -print.

These clear warnings remind us that parsing filenames in bash can be, and is, tricky business. However, with the right options to find, namely -print0, and xargs, namely -0, all our special character containing filenames can be parsed correctly:

find . -name 'a*' -print0

find . -name 'a*' -print0 | xargs -0 ls

find . -name 'a*' -print0 | xargs -0 rm

First we check our directory listing. All our filenames containing special characters are there. We next do a simple find ... -print0 to see the output. We note that the strings are NULL terminated (with the NULL or - the same character - not visible).

We also note that there is a single CR in the output, which matches with the single CR we had introduced into the first filename, comprised of a followed by enter followed by b.

Finally, the output does not introduce a newline (also containing CR) before returning the $ terminal prompt, as the strings were NULL and not CR terminated. We press enter at the $ terminal prompt to make things a bit clearer.

Next we add xargs with the -0 options, which enables xargs to handle the NULL terminated input correctly. We see that the input passed to and received from ls looks clear and there is no mangling of transformation of text happening.

Finally we re-attempt our rm command, and this time for all the files including the original one containing the CR which we had issues with. The rm works perfectly, and no errors or parsing issues are observed. Great!

Wrapping up

We have seen how it is important, in many instances, to correctly parse and handle file names in Bash. Whereas learning how to use find correctly is a bit more challenging then simply using ls, the benefits it provides may pay off in the end. Increased security, and no issues with special characters.

If you enjoyed this article, you may also want to read How to Bulk Rename Files to Numeric File Names in Linux which shows an interesting and somewhat complex find -print0 | xargs -0 statement. Enjoy!