Disclosure: And-httpd is a HTTP server I wrote.
You've all seen HTTP requests, right? They're simple and look like:
GET / HTTP/1.1 Host: foo.example.com Accept: text/plain;q=0.1,text/*;q=1,*/*;q=0 Range: bytes=8-
...right? WRONG. This is the exact same request, and is completely rfc2616 compliant:
GET / HTTP / 1 . 1 Accept: text/plain ; q = 0 .1 ,,,, ,, ,,, ,,, ,,, ,,, ,,,,,,,, ,,,, ,, text/* ; q = 1 . 00 Range: bytes = 8 - Host: foo.example.com Accept: , */* ; q = 0 ,
If you too wish to explore the joy of simplicity that is the HTTP/1.1 rfc, you can look at either the w3 html version of rfc2616 or the ietf text version of rfc2616. The rest of this page includes tips, and "opinions".
This a list of the std. HTTP/1.1 headers a read-only server MAY, SHOULD, or MUST understand (see rfc4229 for a list of all headers).
This is a list of "features" in the HTTP/1.1 "standard". Some are just bad standardization of current practice, some are features at the wrong layer and some are bits the std. didn't cover well. Almost all servers fail some of these (even apache-httpd doesn't pass them all).
I could forgive a lot of the following if the w3c, or any of the people listed at the top of rfc2616, wrote some simple unit tests. It really can't be that hard, if you want xxx return code from Y request ... then send that request to a network connection and see if you've got it. Then any sane person can have a quick look through the rfc, impliment what they think is required and when they're finished run the tests to see. But no.
The problem is that a valid request can look like:
GET / HTTP / 1 . 1 Host: foo.example.com Accept: text/html , text/* ; q = 1 . 00
This is complete crack. This is hidden at the bottom of section 2.1 "Augmented BNF" ... which you're almost guaranteed to skip. It feels like they threw their hands up at all the crap implementations adding space where it shouldn't be and said "Ok, you can now have ((\r\n)?[ \t]+)* between any two tokens". Joy.
apache-httpd completely fails to honor this in a lot of cases, but sometimes it does ... and which cases are completely undocumented and have no rationle that I can see. So you just have to screw all the parsing code in the server with tests for implied LWS. Joy.
Note that the Request-Line doesn't come under implied LWS, but is explicitly allowed to have more than one space between tokens in section 19.3 ... and apache-httpd does parse the above request line.
This also leads to people assuming implied LWS where LWS seems to be explicitly disallowed, as in the following request:
GET / HTTP/1.1 HOST : foo.example.com
...now from my reading of section 4.2 it doesn't say you can follow the header name with anything other than a colon (":"), but yet because implied LWS is allowed everywhere else people/clients do requests like the above ... and apache-httpd allows it. Joy.
Ok, so they wanted to make sure all clients passed a host header, fine. But why make the server respond with a 400, and not a 404 if the "host" passed doesn't exist? apache-httpd just defaults to the "main" host, as does Tux. And-httpd responds with 400, by default.
And then there is the problem that if you pass an absolute URI it is supposed to override the host header ... so does that mean you don't need to pass the host header? apache-httpd says no, And-httpd assumes that's fine ... who can it hurt?
Most of the list headers allow you to specify a "quality", so the server can work out which of a number of choices the clients wants to make without the client having to do a request-response-request, and incur the latency of a round trip. However for accept-encoding there is also an "identity" option, which refers to the entity without any encoding. So given the request:
GET / HTTP/1.1 Host: foo Accept-Encoding: gzip;q=0.1, identity
...the server has to respond without any content-encoding, also given the request:
GET / HTTP/1.1 Host: foo Accept-Encoding: identity;q=0, *
...the entity should only be sent with some content-encoding or the server SHOULD respond with error 406. apache-httpd does neither of these things. In fact apache-httpd doesn't even obey the request:
GET / HTTP/1.1 Host: foo Accept-Encoding: gzip;q=0, *
...if it is configured with mod_deflate. If that wasn't screwed up enough the quality parameter is a fraction "upto 3 decimal places" between 0 and 1 inclusive. Or as a sane person might say: "It's a number between 0 and 1000, with oportunity for parsing errors at every turn". I mean, who is this possibly helping? Is it possible some clients just do sprintf(buf, "%s: %s;q=%s", ...) and present the user with the 0.3 number? Why would it be harder for them to see a number from 0 to 1000? Is there some style thing I don't know about that makes 0.2 and 0.02 better than 200 and 20?
The Range specifies a content range within the encoded entity. What happens if the content-encoding disappears? I assume that to use accept-encoding and range the server has to give out an ETag? Or the client just assumes if it worked before, it works again and if that assumption fails it tries again? Using if-unmodified-since probably helps ... but this just seems broken.
You MUST ignore any header name in the connection header of a HTTP/1.0 request, to work around buggy HTTP/1.0 proxies. So given the request:
GET / HTTP/1.0 Host: foo Range: bytes=1-8 Connection: host, range, close
...the server is supposed to act as if it got no headers. apache-httpd gets this wrong, so does almost everyone else ... woo woo stds.
Massive layer violation. 99.9999% of the time noone wants to do more than continue a failed download, if-range and range solve that problem. Having multipart/byterange returns for multiple range requests is crack, almost nothing needs it and almost no servers support multi-range responses (so they all implement a custom unstandardized "works if you only pass one range, doesn't otherwise" model). Also most servers don't seem to limit the number of range requests you can specify, apart from the generic header size limits.
If they'd done the layering over the entire HTTP request, this could have worked much better. Just say, on a keep alive, the server MUST keep the return code of the last request and it would allow the client to send a set of requests like...
GET / HTTP/1.1 Host: foo.example.com Range: bytes=1-8 If-Range: Sun, 10 Oct 2004 07:02:24 GMT GET / HTTP/1.1 Host: foo.example.com Range: bytes=12-16 If-Previous-Return-Code: 206 GET / HTTP/1.1 Host: foo.example.com Range: bytes=32-128 If-Previous-Return-Code: 206
...I'd bet all servers could easily handle the extra burden to support that single header and the response wouldn't have to look completely different for those 0.0001% of requests that wanted to do things like the above.
Sure, it means if you want multiple range requests you need a server that supports keep-alive, all I can say to that is ... "Here's two cents kid, go buy yourself a real webserver". And-httpd can now be configured to respond with a specified number of multipart/byterange requests.
So a header that takes a list can have any number of empty entries. Ie.
GET / HTTP/1.1 Host: foo.example.com Accept: text/*;q=0.1 ,, ,, text/html
...because, you know, it's more fun that way. Maybe the idea is whoever does the best ASCII art wins something ... a lobotomy probably.
So instead of having to do...
GET / HTTP/1.1 Host: foo.example.com Accept: text/*;q=0.1,text/html
GET / HTTP/1.1 Host: foo.example.com Accept: text/*;q=0.1, text/html
...you can also do...
GET / HTTP/1.1 Accept: text/*;q=0.1 Host: foo.example.com Accept: text/html
...and you have to join the headers to parse the latter identically. Not only is it pointless random crap, it means you can't take a subsection of the request for each header.
After all the hacks with accept* header quality values and the multipart/byterange hack, just so clients only needed to send a single request, one of the most common operations a webserver has to do requires a request/response/request. Ie. 301 redirects, and example of which is:
GET /bar HTTP/1.1 Host: foo.example.com
...when http://foo.example.com/bar is a directory, the server responds with something like:
HTTP/1.1 301 Moved Permanently Date: Mon, 14 Feb 2005 06:04:27 GMT Server: foo/1.0.0 Last-Modified: Mon, 14 Feb 2005 06:04:27 GMT Content-Type: text/html Content-Length: 500 Location: http://foo.example.com/bar/ [...snip content...]
...but if you respond with:
HTTP/1.1 200 OK Date: Mon, 14 Feb 2005 06:04:27 GMT Server: foo/1.0.0 Last-Modified: Mon, 14 Feb 2005 06:04:27 GMT Content-Type: text/html Content-Length: 500 Content-Location: http://foo.example.com/bar/ [...snip content...]
...you might expect that all the links within the document would be relative to http://foo.example.com/bar/ and not http://foo.example.com/bar ... but you'd be wrong.
In fact the standard was specifically worded to enable this behaviour, saying "The value of Content-Location also defines the base URI for the entity." ... however none of the browser authors implemented this feature, due to a fear that some people somewhere might have misconfigured their servers.
There is a general argument against termination based protocols in DJB's netstrings protocol. However choosing CRLF (two characters) and then having the wording that rfc2616 does (search for LF, then remove CR if it's there just before it -- and please don't randomly insert one without the other) causes even more problems, and they are security related (response splitting being the obvious one).
And-httpd just ignores this "advice", always looks for CRLF to delineate lines and treats a single CR or LF as a parse error ... either by returning 400 responses to data it needs to repeat (like host or URI), or by ignoring the header (in the same way it would ignore it if it contained NIL values).
All major clients send the correct CRLF encoding, and while it's possible some minor clients may be sending just LF I have no sympathy for accomodating them.