HTTP for servers

Disclosure: And-httpd is a HTTP server I wrote.

You've all seen HTTP requests, right? They're simple and look like:

GET / HTTP/1.1
Host: foo.example.com
Accept: text/plain;q=0.1,text/*;q=1,*/*;q=0
Range: bytes=8-

...right? WRONG. This is the exact same request, and is completely rfc2616 compliant:

GET    /    HTTP   /   1  .   1
Accept:  text/plain   ;  q    = 0
         .1
        ,,,,
         ,,
      ,,,  ,,,
      ,,,  ,,,
      ,,,,,,,,
        ,,,,
         ,,
                    text/*  ; q =
         1                        . 00        
Range: bytes                    =
         8 
         -
Host: 
       foo.example.com
       
Accept:                , */* ; q = 0 ,

If you too wish to explore the joy of simplicity that is the HTTP/1.1 rfc, you can look at either the w3 html version of rfc2616 or the ietf text version of rfc2616. The rest of this page includes tips, and "opinions".

Headers

This a list of the std. HTTP/1.1 headers a read-only server MAY, SHOULD, or MUST understand (see rfc4229 for a list of all headers).

Accept: SHOULD. (input)
Accept-Charset: SHOULD/MAY (says: though the sending of an unacceptable response is also allowed). (input)
Accept-Encoding: SHOULD. (input)
Accept-Language: MAY. (input)
Accept-Ranges: MAY. (output)
Age: MAY. MUST if server caches data. (output)
Allow: MAY. MUST when using 405 return code. (output)
Authorization: MAY. (input)
Cache-Control: MAY. (output)
Connection: MUST. (input) (output)
Content-Encoding: MAY. (output)
Content-Language: MAY. (output)
Content-Length: MUST. (input) (output)
Content-Location: MAY. SHOULD when using variant entities. (output) NOTE: Doesn't work as substitute for Location re-directs.
Content-MD5: MAY. (input) (output)
Content-Range: MAY. SHOULD when using 416 return code. (output)
Content-Type: SHOULD. (output)
Date: MUST. (output)
ETag: MAY. (output)
Expect: MUST. (input)
Expires: MAY. (output)
From: MAY. (input)
Host: MUST. (input)
If-Match: MUST. (input)
If-Modified-Since: MUST. (input)
If-None-Match: MUST. (input)
If-Range: SHOULD. (input)
If-Unmodified-Since: MUST. (input)
Last-Modified: SHOULD. (output)
Location: MUST for 201 return code. SHOULD for 3xx return codes. (output)
Max-Forwards: PROXY ONLY: MUST. (input)
Pragma: PROXY ONLY: SHOULD. (input)
Proxy-Authenticate: PROXY ONLY: MUST with return code 407. (output)
Proxy-Authorization: PROXY ONLY: MAY. (input)
Range: SHOULD. (input)
Referer: MAY. (input)
Retry-After: MAY with 503 and 3xx return codes. (output)
Server: MUST. (output)
TE: MAY. (input)
Trailer: MUST NOT. SHOULD when using chunked encoding. (output)
Transfer-Encoding: MAY. (output) MUST. (input)
Upgrade: MAY. (input) MUST when using 101 return code. (output)
User-Agent: MAY (input)
Vary: MAY, SHOULD depending on cachability. (output)
Via: MAY, MUST when getting data from another server. (output)
Warning: MAY. (output)
WWW-Authenticate: MAY, MUST when using 401 return code. (output)

Features of HTTP/1.1, and rants about them

This is a list of "features" in the HTTP/1.1 "standard". Some are just bad standardization of current practice, some are features at the wrong layer and some are bits the std. didn't cover well. Almost all servers fail some of these (even apache-httpd doesn't pass them all).

No testing framework:
I could forgive a lot of the following if the w3c, or any of the people listed at the top of rfc2616, wrote some simple unit tests. It really can't be that hard, if you want xxx return code from Y request ... then send that request to a network connection and see if you've got it. Then any sane person can have a quick look through the rfc, impliment what they think is required and when they're finished run the tests to see. But no.
Implied *LWS:

The problem is that a valid request can look like:
```
GET    /    HTTP   /   1  .   1
Host: 
       foo.example.com
       
Accept:  text/html , 
         text/*  ; q =
         1   . 00
```
This is complete crack. This is hidden at the bottom of section 2.1 "Augmented BNF" ... which you're almost guaranteed to skip. It feels like they threw their hands up at all the crap implementations adding space where it shouldn't be and said "Ok, you can now have ((\r\n)?[ \t]+)* between any two tokens". Joy.
apache-httpd completely fails to honor this in a lot of cases, but sometimes it does ... and which cases are completely undocumented and have no rationle that I can see. So you just have to screw all the parsing code in the server with tests for implied LWS. Joy.
Note that the Request-Line doesn't come under implied LWS, but is explicitly allowed to have more than one space between tokens in section 19.3 ... and apache-httpd does parse the above request line.
This also leads to people assuming implied LWS where LWS seems to be explicitly disallowed, as in the following request:
```
GET / HTTP/1.1
HOST : foo.example.com
```
...now from my reading of section 4.2 it doesn't say you can follow the header name with anything other than a colon (":"), but yet because implied LWS is allowed everywhere else people/clients do requests like the above ... and apache-httpd allows it. Joy.
Host header:
Ok, so they wanted to make sure all clients passed a host header, fine. But why make the server respond with a 400, and not a 404 if the "host" passed doesn't exist? apache-httpd just defaults to the "main" host, as does Tux. And-httpd responds with 400, by default.
And then there is the problem that if you pass an absolute URI it is supposed to override the host header ... so does that mean you don't need to pass the host header? apache-httpd says no, And-httpd assumes that's fine ... who can it hurt?
Identity and Quality:
Most of the list headers allow you to specify a "quality", so the server can work out which of a number of choices the clients wants to make without the client having to do a request-response-request, and incur the latency of a round trip. However for accept-encoding there is also an "identity" option, which refers to the entity without any encoding. So given the request:
```
GET / HTTP/1.1
Host: foo
Accept-Encoding: gzip;q=0.1, identity
```
...the server has to respond without any content-encoding, also given the request:
```
GET / HTTP/1.1
Host: foo
Accept-Encoding: identity;q=0, *
```
...the entity should only be sent with some content-encoding or the server SHOULD respond with error 406. apache-httpd does neither of these things. In fact apache-httpd doesn't even obey the request:
```
GET / HTTP/1.1
Host: foo
Accept-Encoding: gzip;q=0, *
```
...if it is configured with mod_deflate. If that wasn't screwed up enough the quality parameter is a fraction "upto 3 decimal places" between 0 and 1 inclusive. Or as a sane person might say: "It's a number between 0 and 1000, with oportunity for parsing errors at every turn". I mean, who is this possibly helping? Is it possible some clients just do sprintf(buf, "%s: %s;q=%s", ...) and present the user with the 0.3 number? Why would it be harder for them to see a number from 0 to 1000? Is there some style thing I don't know about that makes 0.2 and 0.02 better than 200 and 20?
Range and Accept-Encoding headers:
The Range specifies a content range within the encoded entity. What happens if the content-encoding disappears? I assume that to use accept-encoding and range the server has to give out an ETag? Or the client just assumes if it worked before, it works again and if that assumption fails it tries again? Using if-unmodified-since probably helps ... but this just seems broken.
Connect header and HTTP/1.0 connections:
You MUST ignore any header name in the connection header of a HTTP/1.0 request, to work around buggy HTTP/1.0 proxies. So given the request:
```
GET / HTTP/1.0
Host: foo
Range: bytes=1-8
Connection: host, range, close
```
...the server is supposed to act as if it got no headers. apache-httpd gets this wrong, so does almost everyone else ... woo woo stds.
Range and multi-range returns:
Massive layer violation. 99.9999% of the time noone wants to do more than continue a failed download, if-range and range solve that problem. Having multipart/byterange returns for multiple range requests is crack, almost nothing needs it and almost no servers support multi-range responses (so they all implement a custom unstandardized "works if you only pass one range, doesn't otherwise" model). Also most servers don't seem to limit the number of range requests you can specify, apart from the generic header size limits.
If they'd done the layering over the entire HTTP request, this could have worked much better. Just say, on a keep alive, the server MUST keep the return code of the last request and it would allow the client to send a set of requests like...
```
GET / HTTP/1.1
Host: foo.example.com
Range: bytes=1-8
If-Range: Sun, 10 Oct 2004 07:02:24 GMT

GET / HTTP/1.1
Host: foo.example.com
Range: bytes=12-16
If-Previous-Return-Code: 206

GET / HTTP/1.1
Host: foo.example.com
Range: bytes=32-128
If-Previous-Return-Code: 206
```
...I'd bet all servers could easily handle the extra burden to support that single header and the response wouldn't have to look completely different for those 0.0001% of requests that wanted to do things like the above.
Sure, it means if you want multiple range requests you need a server that supports keep-alive, all I can say to that is ... "Here's two cents kid, go buy yourself a real webserver". And-httpd can now be configured to respond with a specified number of multipart/byterange requests.
Empty entries (for list headers):
So a header that takes a list can have any number of empty entries. Ie.
```
GET / HTTP/1.1
Host: foo.example.com
Accept: text/*;q=0.1 ,, ,, text/html
```
...because, you know, it's more fun that way. Maybe the idea is whoever does the best ASCII art wins something ... a lobotomy probably.
Multiple headers (for list headers):
So instead of having to do...
```
GET / HTTP/1.1
Host: foo.example.com
Accept: text/*;q=0.1,text/html
```
...or even...
```
GET / HTTP/1.1
Host: foo.example.com
Accept: text/*;q=0.1,
        text/html
```
...you can also do...
```
GET / HTTP/1.1
Accept: text/*;q=0.1
Host: foo.example.com
Accept: text/html
```
...and you have to join the headers to parse the latter identically. Not only is it pointless random crap, it means you can't take a subsection of the request for each header.
Content-Location and 301:
After all the hacks with accept* header quality values and the multipart/byterange hack, just so clients only needed to send a single request, one of the most common operations a webserver has to do requires a request/response/request. Ie. 301 redirects, and example of which is:
```
GET /bar HTTP/1.1
Host: foo.example.com
```
...when http://foo.example.com/bar is a directory, the server responds with something like:
```
HTTP/1.1 301 Moved Permanently
Date: Mon, 14 Feb 2005 06:04:27 GMT
Server: foo/1.0.0
Last-Modified: Mon, 14 Feb 2005 06:04:27 GMT
Content-Type: text/html
Content-Length: 500
Location: http://foo.example.com/bar/
 
[...snip content...]
```
...but if you respond with:
```
HTTP/1.1 200 OK
Date: Mon, 14 Feb 2005 06:04:27 GMT
Server: foo/1.0.0
Last-Modified: Mon, 14 Feb 2005 06:04:27 GMT
Content-Type: text/html
Content-Length: 500
Content-Location: http://foo.example.com/bar/
 
[...snip content...]
```
...you might expect that all the links within the document would be relative to http://foo.example.com/bar/ and not http://foo.example.com/bar ... but you'd be wrong.
In fact the standard was specifically worded to enable this behaviour, saying "The value of Content-Location also defines the base URI for the entity." ... however none of the browser authors implemented this feature, due to a fear that some people somewhere might have misconfigured their servers.
CRLF: Response splitting etc.
There is a general argument against termination based protocols in DJB's netstrings protocol. However choosing CRLF (two characters) and then having the wording that rfc2616 does (search for LF, then remove CR if it's there just before it -- and please don't randomly insert one without the other) causes even more problems, and they are security related (response splitting being the obvious one).
And-httpd just ignores this "advice", always looks for CRLF to delineate lines and treats a single CR or LF as a parse error ... either by returning 400 responses to data it needs to repeat (like host or URI), or by ignoring the header (in the same way it would ignore it if it contained NIL values).
All major clients send the correct CRLF encoding, and while it's possible some minor clients may be sending just LF I have no sympathy for accomodating them.

James Antill

Last modified: Tue May 29 19:35:38 EDT 2007