Debugging HTTP Headers

While writing the skreemr shell the other week I ran into an issue that required me to dig down to one of the lowest levels of web communication… HTTP Headers. I’m always shocked to learn that so many web developers don’t know much about headers. So I thought I would try to reenforce the point that HTTP Headers matter… and that knowing your stuff can help you debug and solve problems. Here is the story of my real world example.

I’m not going to go over HTTP Headers. thats already been done. Instead I’m going to focus on debugging and working with them a little bit. I’m going to assume you have already the general concepts.

The Problem

I wanted to add pagination to the skreemr shell. So you could run a search, then get the next page of results, etc. So a little experimentation with skreemr in my browser produced the following URLs:

http://skreemr.com/results.jsp?q=test

http://skreemr.com/results.jsp?q=test&l=10&s=10

http://skreemr.com/results.jsp?q=test&l=10&s=20

A simple pattern! “q” is the query string, “l” is the number per page, “s” is the number to start at, indexed from 0. So that last URL would produce results 21-30. This all worked well in my browsers, but it wasn’t working in Ruby:

require 'open-uri'

# Grab the pages and read the content
str1 = open( 'http://skreemr.com/results.jsp?q=test'      ).read  # 1-10
str2 = open( 'http://skreemr.com/results.jsp?q=test&s=10' ).read  # 11-20

#=> Should print false... but its printing true!
puts str1==str2

What this was saying was that the content being returned from both of those urls is EXACTLY the same. That couldn’t be… could it?

More Investigation

At this point I thought it was a problem in Ruby. I figured Ruby was doing some caching in the background that I was going to have to disable or work around. (I now know this is not true, but that was my first guess). To test that hypothesis I turned to my trusty friend curl and checked to see if that showed the proper behavior:

shell> curl 'http://skreemr.com/results.jsp?q=test'      -o 1.html
shell> curl 'http://skreemr.com/results.jsp?q=test&s=10' -o 2.html
shell> diff -q -s 1.html 2.html
Files 1.html and 2.html are identical

What!?! That stunned me. Curl was getting the exact same results as Ruby. I took a look at the html files, and indeed 2.html, which should have contained results 11-20 held 1-10. I opened both urls in my browser… they showed the correct results. Something weird was happening!

Take A Step Back

At this point you’ve got to know what is happening. Curl and Ruby (using Ruby’s Net:HTTP under the hood) are just making a simple GET request. Both my browsers Safari and Firefox are sending far more then just a GET request. They are sending a bunch of other headers. Lets take a look at what Firebug says Firefox sent:

firebug

Thats a mighty long list of headers! Its entirely possible that one of those might be influencing the server. Onto the drama!

Time to Act – Literally

The only difference between what curl sent in its request and what Firefox was sending is that header information and possibly any information contained in those cookies (unlikely in this case). So lets make curl act as if it is Firefox by send the same headers with curl! I took a quick peek at my curl reference for the proper switches/formatting and I was ready:

shell> curl 'http://skreemr.com/results.jsp?q=test' -o 1.html
shell> curl 'http://skreemr.com/results.jsp?q=test&s=10'            \
            -H 'Host: skreemr.com'                                  \
            -H 'User-Agent: Mozilla/5.0 (...) Firefox/3.0.6'        \
            -H 'Accept: text/html,application/xml;q=0.9,*/*;q=0.8'  \
            -H 'Accept-Language: en-us,en;q=0.5'                    \
            -H 'Accept-Encoding: gzip,deflate'                      \
            -H 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7'     \
            -H 'Keep-Alive: 300'                                    \
            -H 'Connection: keep-alive'                             \
            -H 'Cache-Control: max-age=0'                           \
            -o 2.html
shell> diff -q -s 1.html 2.html
Files 1.html and 2.html differ

Jackpot! Some subset of those headers is indeed fixing my problem, because now it properly fetched the second page of results! Translating that back into Ruby works as well:

require 'open-uri'

# Headers
HEADERS = {
  'Host'            => 'skreemr.com',
  'User-Agent'      => 'Mozilla/5.0 (...) Firefox/3.0.6',
  'Accept'          => 'text/html,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language' => 'en-us,en;q=0.5',
  'Accept-Encoding' => 'gzip,deflate',
  'Accept-Charset'  => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
  'Keep-Alive'      => '300',
  'Connection'      => 'keep-alive',
  'Cache-Control'   => 'max-age=0'
}

# Get the Pages
str1 = open( 'http://skreemr.com/results.jsp?q=test', HEADERS ).read
str2 = open( 'http://skreemr.com/results.jsp?q=test&s=10', HEADERS ).read

# Correctly Prints false
puts str1==str2

Problem Solved

Sure, we have a solution but its not pretty. How much is really needed? Finding that out is just a matter of eliminating the headers that do nothing. Remove one, test, remove another, test, remove another test… As long as it still works after you’ve removed an individual header you know that header wasn’t needed. It took me a few minutes but I narrowed it down to just a single header:

MAGIC = { 'Accept-Language' => 'en-us' }

This was a rather unique problem. It is rare for me to have to dip down to the HTTP Headers to see what is actually happening for such a high level problem. However, as this problem shows, its important to know what is going on under the hood. If I didn’t know about headers, I would not have been able to solve it.

AtomPub Overview and Curl Reference

Not long ago I had to learn about the Atom Publishing Protocol for my job. I spent about a week learning on my own time all about XML, AtomPub, and even the basics of HTTP. After that week I decided to write down my own personal overview and example code to try and “visually” explain AtomPub as best I could. The result was (and is):

My Visual Guide to AtomPub

Now keep in mind that I wrote that only a few weeks after learning it. The process of writing that guide forced myself to study it in greater detail than normal, actually run tests, and produce realistic output and examples. I know its not perfect (I’d probably be slaughtered for my definition of REST) but over time I’ll be happy to improve and update it. I think the design really improves the content making it readable, fun, and useful to refer to.

I’m linking to it now because I’ve done a number of projects like this (my Unix Tutorial) because I like sites that are strictly focused on one thing and do that one thing very well. I’ll probably spend a little bit of time on remainder of my break from school by cleaning up these small “brain dump” websites. I wanted to make sure they were mentioned and linked to from my blog. Clearly they will be of no use to anyone if they are never linked to!

I decided to include a small `curl` reference on my AtomPub guide. This is because its a very nice tool when working with HTTP requests and an overall generally useful shell program. I think people might find the curl reference useful.

I hope you enjoy this. I’ll be linking to these occasionally as they grow.

search