Project 3

Out: Monday, Nov. 10
Due: Wednesday, Dec. 3, 11:59pm

In this project you will create your own Web server. The server will implement a subset of the HTTP protocol only, but it will be functional enough to be used for simple browsing using your favorite Web browser.

The server must be written in C or C++.

When done, please use

  ~cs352/bin/turnin Project3
to turn in your project. Make sure you include all relevant files.


Hypertext Transfer Protocol (HTTP) in brief

HTTP stands for Hypertext Transfer Protocol. It is used for transferring most of the files over the Web. This includes text files, PDF documents, images in different formats etc. In principle the objects transferred (called resources) can be anything, including dynamically generated content that is produced on the fly by a program or script. In this project we will transfer existing files only.

The HTTP protocol is an example of client-server communication. At the highest level it works as follows:

An HTTP message (either request or response) is a text string consisting of a header and a body. The header has the following format:

  INITIAL LINE
  HEADER1: VALUE1
  HEADER2: VALUE2
  ...
  HEADERn: VALUEn

The initial line consists of 3 words: the method word, a resource, and the protocol being used, which includes the protocol version. It is terminated by a CRLF. A CRLF is a carriage return character (1 byte, ASCII code 13 decimal), followed by a linefeed character (1 byte, ASCII code 10 decimal).

The remainder of the header contains information about the message. Each additional line consists of a word referring to a property, followed by a colon, followed by whitespace, followed by the value for that property. Property values can contain spaces. An example property of a request is the User-Agent property, whose value includes the program name and version of the client making the request. An example of a property of a response is the Content-type property, which describes the type of the content of the server response (e.g., text/plain, image/png, video/mpeg etc). A property that can occur in either the request or response is the Content-length property whose value is the length in bytes of the message body (if any).

An example request is

  GET /A/B/C/file.html HTTP/1.1
  Host: www.cs.iastate.edu
  User-Agent: mywebclient/1.0
  [empty line here]
In this example the client request consists of the GET command, which is used to retrieve the resource http://www.cs.iastate.edu/A/B/C/file.html. This corresponds to a file named file.html which resides at path A/B/C within the web server's file space. The client also specifies that this request follows the version 1.1 HTTP protocol specification. The remaining header lines contain additional information about the request. In this project we will ignore all request header lines sent by the client except the initial one.

GET is the only command that you will implement for this project.

The request body is optional and is separated from the header by an empty line (just a CRLF with no text). In the previous example the body is empty.

After the request is received, the server sends back a response. The response follows the same template i.e., consists of a header and a body, with the header consisting an initial line, a number of lines in the same format as the request, and an empty line that acts as a terminator.

An example response is

  HTTP/1.1 200 OK
  Date: Thu, 05 Apr 2007 05:09:59 GMT
  Server: Apache/2.0.40 (Red Hat Linux)
  Last-Modified: Thu, 26 Jan 2006 08:44:44 GMT
  Content-Length: 5806
  Connection: close
  Content-Type: text/html; charset=ISO-8859-1
  [empty line here]
  [body containing a 5806-byte HTML document]
In this example the initial line indicates success. The first word on this line is the HTTP specification that this response will adhere to. The second word is a numerical response code meant to be easily parsable by the client. The remaining words are an informational message explaining the response meant to be human-readable, and may vary from server to server. Depending on their numeric value, response codes can be classified as follows:

The response codes we'll be using in this project for our server are:

The remaining lines in the example response above contain information about the message and server.


Sockets

To connect to a web server, both the client and the server must create a socket. In this project you will have to familiarize yourselves with socket programming. Sockets are software constructs that allow two processes to communicate, either on the same machine or across the Internet. They were described in class; here are the notes we used in PDF format.

You may use the following C++ code as a starting point for Project 3. It shows how to establish a TCP/IP connection between two processes (a client and server). Once the connection is established, data is sent across the connection by writing to and reading from file descriptors.

tcp-server.cc
A simple TCP/IP server.
tcp-client.cc
A simple TCP/IP client.

The system calls that you will use for this project are the following. Note that you might not need to use all of them, depending on your implementation. Also note that the list not exclusive i.e., you may use other system calls as well if you like.

All these calls have corresponding man pages.


Web server overview

The server that you'll implement for this project has a simple structure that can be summarized as follows: Continuously wait for an incoming connection, accept it (if the client is not listed in the "forbidden" list, see below), service it (if possible), disconnect, and go back to waiting for another connection. In more detail the steps of your server are as follows, together with relevant system calls that you can use to implement them:

  1. Create socket (using socket()).
  2. Name it (using bind()).
  3. Specify the maximum size of the queue the pending connection requests (using listen()). For a single-threaded server a queue of size 1 is enough. If you choose to implement the optional multi-threaded server you may use a limit of 5.
  4. Wait for and accept any incoming connection (using the accept() system call). This creates a new socket that can be used for communication with the client.
  5. Check if the client address is in the "forbidden" list. Disconnect if so (using close()).
  6. Obtain the request from the client (using read()).
  7. Service the request if possible, sending back the resource requested (using write()).
  8. Log the connection into the log file (see below).
  9. Disconnect (use close() to delete the new socket).
  10. Go to 4.

Note: The accept() system call gives you new a file descriptor to communicate with the client. File descriptor I/O is generally considered low-level because one uses read() and write() that lack the advanced formatting facilities of fprintf() or input facilities of fscanf(). To use fprintf() or fscanf() one must obtain a FILE pointer (the first argument of fprintf() is a "FILE *". This conversion can be easily accomplished through the fdopen() call. For details on how to use it see the man page of fdopen().


Required features

5 points: makefile

Create a functional makefile. Name your Web server executable webserv and make sure that typing make will build it.

10 points: Documentation

Create a file named README that describes the functionality of each of your source files and each function within them. You may use any number of source files and/or functions but you must describe them in the README file in sufficient detail so that a technical person (i.e., another programmer) can understand your code.

40 points: GET command

Implement the GET command of the HTTP protocol. This includes accepting an incoming connection, reading the request line and the entire header (up to the first empty line), returning the file requested in the request line, and closing the connection. If the file does not exist, you must return the appropriate error message as described above. If it exists but cannot be opened for reading you must return again a "forbidden" response as described above. You can determine what is the cause of an error of an open() call by examining the system-defined variable errno; see "man errno" and "man 2 open".

When responding to a request, your server should return at least the following header lines:

The web server executable should take two arguments: (1) A directory name under which the web servers files and sub-directories are stored. All file names requested using the GET command are relative to this directory. (2) The port that the web server is listening to. Note that ports 1024 and under are privileged and cannot be used (you'll get a "permission denied" message if you try to use one of them).

20 points: Connection logging

Log each incoming connection in a log file named webserv.log. Each connection entry should occupy one line, and must be in the format of the following example:

Connection from host 129.186.67.3 (pyrite-m.cs.iastate.edu), port 47282 on Sun, 01 Apr 2007 01:24:18 PM, file "index.html", status 200 OK
You can see an example of how to produce the current date and time in a string in this example C program.

25 points: Reject forbidden addresses

Read a list of addresses from file forbidden.txt in the server's directory that contains human-readable addresses of hosts that would be rejected. The format of the file is one forbidden address per line. Any request from these hosts, or any connection from a host that does not resolve to a human-readable address (and therefore cannot be checked against the forbidden list) should be rejected. This can be done by comparing each of the forbidden addresses to the human-readable host name of each incoming connection request. For an example of how to obtain the human-readable Internet address of an incoming connection see the tcp-server.cc program.


Optional feature

30 points: Multiple concurrent connections and multithreaded server

Implement a multithreaded server that can service multiple connections simultaneously. Of course, this requires the use of appropriate mechanisms for concurrent access to shared data e.g., the connection log (remember Project 2?).

The general structure of the main thread will now be:

  1. Wait for an incoming connection in the main thread.
  2. Accept the incoming connection request using accept().
  3. Create a new thread that handles the request coming through the newly created socket returned by accept().
  4. Service the request in the new thread.
  5. Log the connection request and response in the log.
  6. Terminate the thread (return from it).
  7. Join the main thread with the terminated thread. This can be done by the child thread setting a flag before terminating that indicates that it will terminate soon so that the main thread may join with it.

If you choose to implement this extra feature, make sure you include a SEPARATE EMPTY FILE called EXTRACREDIT in the same directory as your code. The TA MAY NOT go though your source files to try to understand if you did implement the extra feature, so it is very important that you indicate this fact by the existence of an EXTRACREDIT file.


Testing your Web server

To test your web server you can use a standard browser e.g., Firefox. You should be able to load a regular page containing text and images. To debug the server the browser alone is probably insufficient as you cannot see the actual messages between the client and the server. A better way for debugging is to connect to the server using the telnet program. Telnet takes an address and optionally the port on that host to connect to. Here's an example exchange with a web server using telnet.

You can use pyrite to develop and compile your server. However, pyrite is firewalled and disallows connections to arbitrary ports from any host other than itself. A better choice for testing your server are the lab machines lin141a through lin141t which are not firewalled among themselves so you can connect to any listening port between any two of them. Note that these machines are accessible remotely only by logging in to pyrite first and using ssh to connect to them.


Relevant resources on the Internet

A simple and easy-to-read description of the HTTP protocol:

If you would like to read all the gory details of the HTTP protocol, here are the relevant RFCs (Requests For Comments):

Here are some more thorough on-line references on sockets: