Quick Links

If you want to merge data from two text files by matching a common field, you can use the Linux join command. It adds a sprinkle of dynamism to your static data files. We'll show you how to use it.

Matching Data Across Files

Data is king. Corporations, businesses, and households alike run on it. But data stored in different files and collated by different people is a pain. In addition to knowing which files to open to find the information you want, the layout and format of the files are likely to be different.

You also have to deal with the administrative headache of which files need to be updated, which need to be backed up, which are legacy, and which can be archived.

Plus, if you need to consolidate your data or conduct some analysis across an entire data set, you've got an additional problem. How do you rationalize the data across the different files before you can do what you need to do with it? How do you approach the data preparation phase?

The good news is if the files share at least one common data element, the Linux join command can pull you out of the mire.

The Data Files

All the data we'll use to demonstrate the use of the join command is fictional, starting with the following two files:

cat file-1.txt

cat file-2.txt

cat file-1.txt in a terminal window

The following is the contents of file-1.txt:

1 Adore Varian avarian0@newyorker.com Female 192.57.150.231

2 Nancee Merrell nmerrell1@ted.com Female 22.198.121.181

3 Herta Friett hfriett2@dagondesign.com Female 33.167.32.89

4 Torie Venmore tvenmore3@gmpg.org Female 251.9.204.115

5 Deni Sealeaf dsealeaf4@nps.gov Female 210.53.81.212

6 Fidel Bezley fbezley5@lulu.com Male 72.173.218.75

7 Ulrikaumeko Standen ustanden6@geocities.jp Female 4.204.0.237

8 Odell Jursch ojursch7@utexas.edu Male 1.138.85.117

We have a set of numbered lines, and each line contains all the following information:

  • A number
  • A first name
  • A surname
  • An email address
  • The person's sex
  • An IP Address

The following is the contents of file-2.txt:

1 Varian avarian0@newyorker.com Female Western New York $535,304.73

2 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10

3 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44

4 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02

5 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15

6 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78

7 Standen ustanden6@geocities.jp Female Capital District $674,634.93

8 Jursch ojursch7@utexas.edu Male Hudson Valley $663,821.09

Each line in file-2.txt contains the following information:

  • A number
  • A surname
  • An email address
  • The person's sex
  • A region of New York
  • A dollar value

The join command works with "fields," which, in this context, means a section of text surrounded by whitespace, the start of a line, or the end of a line. For join to match up lines between the two files, each line must contain a common field.

Therefore, we can only match a field if it appears in both files. The IP address only appears in one file, so that's no good. The first name only appears in one file, so we can't use that either. The surname is in both files, but it would be a poor choice, as different people have the same surname.

You can't tie the data together with the male and female entries, either, because they're too vague. The regions of New York and the dollar values only appear in one file, too.

However, we can use the email address because it's present in both files, and each is unique to an individual. A quick look through the files also confirms the lines in each correspond to the same person, so we can use the line numbers as our field to match (we'll use a different field later).

Note there are a different number of fields in the two files, which is fine---we can tell join which field to use from each file.

However, watch out for fields like the regions of New York; in a space-separated file, each word in the name of a region looks like a field. Because some regions have two- or three-word names, you've actually got a different number of fields within the same file. This is okay, as long as you match on fields that appear in the line before the New York regions.

The join Command

First, the field you're going to match must be sorted. We've got ascending numbers in both files, so we meet that criteria. By default, join uses the first field in a file, which is what we want. Another sensible default is that join expects the field separators to be whitespace. Again, we've got that, so we can go ahead and fire up join.

As we're using all the defaults, our command is simple:

join file-1.txt file-2.txt

join file-1.txt file-2.txt in a terminal window

join considers the files to be "file one" and "file two" according to the order in which they're listed on the command line.

The output is as follows:

1 Adore Varian avarian0@newyorker.com Female 192.57.150.231 Varian avarian0@newyorker.com Female Western New York $535,304.73

2 Nancee Merrell nmerrell1@ted.com Female 22.198.121.181 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10

3 Herta Friett hfriett2@dagondesign.com Female 33.167.32.89 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44

4 Torie Venmore tvenmore3@gmpg.org Female 251.9.204.115 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02

5 Deni Sealeaf dsealeaf4@nps.gov Female 210.53.81.212 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15

6 Fidel Bezley fbezley5@lulu.com Male 72.173.218.75 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78

7 Ulrikaumeko Standen ustanden6@geocities.jp Female 4.204.0.237 Standen ustanden6@geocities.jp Female Capital District $674,634.93

8 Odell Jursch ojursch7@utexas.edu Male 1.138.85.117 Jursch ojursch7@utexas.edu Male Hudson Valley $663,821.09

The output is formatted in the following way: The field the lines were matched on is printed first, followed by the other fields from file one, and then the fields from file two without the match field.

Unsorted Fields

Let's try something we know won't work. We'll put the lines in one file out of order so join won't be able to process the file correctly. The contents of file-3.txt are the same as file-2.txt, but line eight is between lines five and six.

The following is the contents of file-3.txt:

1 Varian avarian0@newyorker.com Female Western New York $535,304.73

2 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10

3 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44

4 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02

5 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15

8 Jursch ojursch7@utexas.edu Male Hudson Valley $663,821.09

6 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78

7 Standen ustanden6@geocities.jp Female Capital District $674,634.93

We type the following command to try to join file-3.txtto file-1.txt:

join file-1.txt file-3.txt

join file-1.txt file-3.txt in a terminal window

join reports that the seventh line in file-3.txt is out of order, so it's not processed. Line seven is the one that begins with the number six, which should come before eight in a correctly sorted list. The sixth line in the file (which begins with "8 Odell") was the last one processed, so we see the output for it.

You can use the --check-order option if you want to see whether join is happy with the sort order of a files---no merging will be attempted.

To do so, we type the following:

join --check-order file-1.txt file-3.txt

join --check-order file-1.txt file-3.txt in a terminal window

join tells you in advance there's going to be a problem with line seven of file file-3.txt.

Files with Missing Lines

In file-4.txt, the last line has been removed, so there isn't a line eight. The contents are as follows:

1 Varian avarian0@newyorker.com Female Western New York $535,304.73

2 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10

3 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44

4 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02

5 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15

6 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78

7 Standen ustanden6@geocities.jp Female Capital District $674,634.93

We type the following and, surprisingly, join doesn't complain and processes all the lines it can:

join file-1.txt file-4.txt

join file-1.txt file-4.txt in a terminal window

The output lists seven merged lines.

The -a (print unpairable) option tells join to also print the lines that couldn't be matched.

Here, we type the following command to tell join to print the lines from file one that can't be matched to lines in file two:

join -a 1 file-1.txt file-4.txt

join -a 1 file-1.txt file-4.txt in a terminal window

Seven lines are matched, and line eight from file one is printed, unmatched. There isn't any merged information because file-4.txt didn't contain a line eight to which it could be matched. However, at least it still appears in the output so you know it doesn't have a match in file-4.txt.

We type the following -v (suppress joined lines) command to reveal any lines that don't have a match:

join -v file-1.txt file-4.txt

join -v file-1.txt file-4.txt in a terminal window

We see that line eight is the only one that doesn't have a match in file two.

Matching Other Fields

Let's match two new files on a field that isn't the default (field one). The following is the contents of file-7.txt:

avarian0@newyorker.com Female 192.57.150.231

dsealeaf4@nps.gov Female 210.53.81.212

fbezley5@lulu.com Male 72.173.218.75

hfriett2@dagondesign.com Female 33.167.32.89

nmerrell1@ted.com Female 22.198.121.181

ojursch7@utexas.edu Male 1.138.85.117

tvenmore3@gmpg.org Female 251.9.204.115

ustanden6@geocities.jp Female 4.204.0.237

And the following is the contents of file-8.txt:

Female avarian0@newyorker.com Western New York $535,304.73

Female dsealeaf4@nps.gov North Country $126,690.15

Male fbezley5@lulu.com Mohawk Valley $366,733.78

Female hfriett2@dagondesign.com Southern Tier $461,664.44

Female nmerrell1@ted.com Finger Lakes $309,033.10

Male ojursch7@utexas.edu Hudson Valley $663,821.09

Female tvenmore3@gmpg.org Central New York $175,818.02

Female ustanden6@geocities.jp Capital District $674,634.93

The only sensible field to use for joining is the email address, which is field one in the first file and field two in the second. To accommodate this, we can use the -1 (file one field) and -2 (file two field) options. We'll follow these with a number that indicates which field in each file should be used for joining.

We type the following to tell join to use the first field in file one and the second in file two:

join -1 1 -2 2 file-7.txt file-8.txt

join -1 1 -2 2 file-7.txt file-8.txt in a terminal window

The files are joined on the email address, which is displayed as the first field of each line in the output.

Using Different Field Separators

What if you have files with fields that are separated by something other than whitespace?

The following two files are comma-delimited---the only whitespace is between the multiple-word place names:

cat file-5.txt

cat file-6.txt

cat file-1.txt in a terminal window

We can use the -t (separator character) to tell join which character to use as the field separator. In this case, it's the comma, so we type the following command:

join -t, file-5.txt file-6.txt

cat file-1.txt in a terminal window

All the lines are matched, and the spaces are preserved in the place names.

Ignoring Letter Case

Another file, file-9.txt, is almost identical to file-8.txt. The only difference is some of the email addresses have a capital letter, as shown below:

Female avarian0@newyorker.com Western New York $535,304.73

Female dsealeaf4@nps.gov North Country $126,690.15

Male Fbezley5@lulu.com Mohawk Valley $366,733.78

Female hfriett2@dagondesign.com Southern Tier $461,664.44

Female nmerrell1@ted.com Finger Lakes $309,033.10

Male Ojursch7@utexas.edu Hudson Valley $663,821.09

Female tvenmore3@gmpg.org Central New York $175,818.02

Female ustanden6@geocities.jp Capital District $674,634.93

When we joined file-7.txt and file-8.txt, it worked perfectly. Let's see what happens with file-7.txt and file-9.txt.

We type the following command:

join -1 1 -2 2 file-7.txt file-9.txt

cat file-1.txt in a terminal window

We only matched six lines. The differences in upper- and lowercase letters prevented the other two email addresses from being joined.

However, we can use the -i (ignore case) option to force join to ignore those differences and match fields that contain the same text, regardless of case.

We type the following command:

join -1 1 -2 2 -i file-7.txt file-9.txt

cat file-1.txt in a terminal window

All eight lines are matched and joined successfully.

Mix and Match

In join, you have a powerful ally when you're wrestling with awkward data preparation. Perhaps you need to analyze the data, or maybe you're trying to massage it into shape to perform an import to a different system.

No matter what the situation is, you'll be glad you have join in your corner!

Linux Commands

Files

tar · pv · cat · tac · chmod · grep ·  diff · sed · ar · man · pushd · popd · fsck · testdisk · seq · fd · pandoc · cd · $PATH · awk · join · jq · fold · uniq · journalctl · tail · stat · ls · fstab · echo · less · chgrp · chown · rev · look · strings · type · rename · zip · unzip · mount · umount · install · fdisk · mkfs · rm · rmdir · rsync · df · gpg · vi · nano · mkdir · du · ln · patch · convert · rclone · shred · srm · scp · gzip · chattr · cut · find · umask · wc · tr

Processes

alias · screen · top · nice · renice · progress · strace · systemd · tmux · chsh · history · at · batch · free · which · dmesg · chfn · usermod · ps · chroot · xargs · tty · pinky · lsof · vmstat · timeout · wall · yes · kill · sleep · sudo · su · time · groupadd · usermod · groups · lshw · shutdown · reboot · halt · poweroff · passwd · lscpu · crontab · date · bg · fg · pidof · nohup · pmap

Networking

netstat · ping · traceroute · ip · ss · whois · fail2ban · bmon · dig · finger · nmap · ftp · curl · wget · who · whoami · w · iptables · ssh-keygen · ufw · arping · firewalld

RELATED: Best Linux Laptops for Developers and Enthusiasts