When you first start learning how domain names, IP addresses, web servers, and websites all fit and work together, it can be a little confusing or overwhelming at times. How is it all set up to work so smoothly? Today’s SuperUser Q&A post has the answers to a curious reader’s questions.
Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.
Photo courtesy of Rosmarie Voegtli (Flickr).
SuperUser reader user3407319 wants to know if web servers only hold one website each:
Based on what I understand about DNS and linking a domain name with the IP address of the web server a website is stored on, does that mean each web server can only hold one website? If web servers do hold more than one website, then how does it all get resolved so that I can access the website I want without any problems or mix ups?
Do web servers only hold one website each, or do they hold more?
SuperUser contributor Bob has the answer for us:
Basically, the browser includes the domain name in the HTTP request so the web server knows which domain was requested and can respond accordingly.
Here is how your typical HTTP request happens:
1. The user provides a URL, in the form http://host:port/path.
2. The browser extracts the host (domain) part of the URL and translates it into an IP address (if necessary) in a process known as name resolution. This translation can occur via DNS, but it does not have to (for example, the local hosts file on common operating systems bypasses DNS).
3. The browser opens a TCP connection to the specified port, or defaults to port 80 on that IP address.
4. The browser sends an HTTP request. For HTTP/1.1, it looks like this:
The host header is standard and required in HTTP/1.1. It was not specified in the HTTP/1.0 spec, but some servers support it anyway.
From here, the web server has several pieces of information that it can use to decide what the response should be. Note that it is possible for a single web server to be bound to multiple IP addresses.
- The requested IP address, from the TCP socket (the IP address of the client is also available, but this is rarely used, and sometimes for blocking/filtering)
- The requested port, from the TCP socket
- The requested host name, as specified in the host header by the browser in the HTTP request
- The requested path
- Any other headers (cookies, etc.)
As you seem to have noticed, the most common shared hosting setup these days puts multiple websites on a single IP address:port combination, leaving just the host to differentiate between websites.
What About HTTPS?
HTTPS is a bit different. Everything is identical up to the establishment of the TCP connection, but after that an encrypted TLS tunnel must be established. The goal is to not leak any information about the request.
In order to verify that the web server actually owns this domain, the web server must send a certificate signed by a trusted third party. The browser will then compare this certificate with the domain it requested.
This presents a problem. How does the web server know which host/website’s certificate to send if it needs to do this before the HTTP request is received?
Traditionally, this was solved by having a dedicated IP address (or port) for every website requiring HTTPS. Obviously, this has become problematic as we are running out of IPv4 addresses.
Enter SNI (Server Name Indication). The browser now passes the host name during the TLS negotiations, so the web server has this information early enough to send the correct certificate. On the web server side, configuration is very similar to how HTTP virtual hosts are configured.
The downside is the host name is now passed as plain text before encryption, and is essentially leaked information. This is usually considered an acceptable trade-off though considering the host name is normally exposed in a DNS query anyway.
What If You Request a Website by IP Address Only?
What the web server does when it does not know which specific host you requested depends on the web server’s implementation and configuration. Typically, there is a “default”, “catch-all”, or “fall back” website specified that will provide responses to all requests that do not explicitly specify a host.
This default website can be its own independent website (often showing an error message), or it could be any of the other websites on the web server depending on the preferences of the web server admin.
Have something to add to the explanation? Sound off in the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.