Information Assurance

Zico Kolter


front | classes | research | personal | contact

Information Assurance

Bugtraq Analysis

Buffer Overflow Vulnerabilities in Gaim

back to bugtraq analyses page

Bugtraq Email: 12 x Gaim remote overflows
Link: http://www.securityfocus.com/archive/1/351235

Gaim is a AOL Instant Messenger client written originally for Linux, but which has since been ported to several other operating systems. This Bugtraq advisory was of particular interest to me because I use Gaim for instant messaging on my Linux machine. The sheer number of vulnerabilities presented was quite disconcerting. Normally, in a widely used program such as Gaim, one buffer overflow presents a large potential problem, but the existence of 12 such bugs seems unthinkable. In addition, the email provided a very good description of the problems, such that it was quite easy to see why there was a potential problem in each situation.

For my presentation, I focused on just one of the vulnerabilities disclosed, which occurs in the gaim_url_parse() function. I will present the problem, show how it could be exploited, explain how it could be fixed, and discuss how it might be possible to prevent such problems in the future.

Here is the relevant code snippet as presented in the email:

gboolean
gaim_url_parse(const char *url, char **ret_host, int
               *ret_port, char **ret_path)
{
   char scan_info[255];
   char port_str[5];
   int f;
   const char *turl;
   char host[256], path[256];
   int port = 0;
   /*hyphen at end includes it in control set */
   static char addr_ctrl[] = "A-Za-z0-9.-";
   static char port_ctrl[] = "0-9";
   static char page_ctrl[] = "A-Za-z0-9.~_/:*!@&%%?=+^-";
   
   ...
   g_snprintf(scan_info, sizeof(scan_info),
              "%%[%s]:%%[%s]/%%[%s]", addr_ctrl,
              port_ctrl, page_ctrl);

   f = sscanf(url, scan_info, host, port_str, path); <-- [10]
   ...
The two lines of interest here are the calls to g_snprintf() and sscanf(), the real problem being the call to sscanf(). The call to g_snprintf() is really just setup for the call to sscanf(), and as such it only really needs to be understood in terms of how it works with the sscanf() call.

The sscanf() function is a standard C library routine that reads formatted input from a string. Lets say that I have a string that contains some input, lets say "Zico 20", and I want to parse this into two data items, one with the string "Zico" and one with the number 20. I would write code like this:

char name[256];
int age;
char *data;
...
sscanf(data, "%s %d", name, &age);
This may look a little complicated, but it's really quite simple. As the first argument, sscanf() takes the input string it will parse. The second argument is the format of the input string. "%s" denotes a string in the input, while "%d" denotes an integer. Of course, there are other several different formatting characters that define how the function will parse the input. This call is like the C++ iostream library call of:
cin >> name >> age;
except that it reads input from a string, not from the user's input at the terminal.

Because a complex formatting string for the sscanf() instruction can be unwieldy, the Gaim authors chose to break up the call into two parts, one call to generate the formatting string, and the second to parse the string. In this case, the call to g_snprintf() always generates the same formatting string, and does not rely on any user input. To be precise, the formatting string it generates will always be:

"%[A-Za-z0-9.-]:%[0-9]/%[A-Za-z0-9.~_/:*!@&%%?=+^-]"
When used with the sscanf(), this format string will read any number of alphanumeric characters, followed by a colon, followed by any number of digits, followed by a slash, followed by any number of alphanumeric characters and certain special symbols. So if it read the string "www.cs.georgetown.edu:80/~clay", it would parse this into three strings, "www.cs.georgetown.edu", "80", and "~clay".

The problem with this, however, occurs when the input is too big to fit in the corresponding strings. In the gaim_url_parse() function, the host variable is declared as an array of 256 characters. So what happens if a malicious person manipulates the IM protocol so as to send Gaim a url with a host that is longer than 256 characters? The sscanf() function will continue reading the string, and write past the end of the array. This will then start to overwrite other data on the stack, such as other variables or strings declared in the function. As the very least, a malformed string would probably corrupt the value of other variables in the function.

The real risk of this kind of buffer overflow, however, is much more serious. When you declare a local variable or array in a function, the program reserves room for these local variables on the stack, a section of memory at the end of a program's usable memory space. But the stack is also used for storing the return addresses of functions. When call a function in C, you jump to the new function, but also push the address of the current instruction on to the stack, so that when you return from the function, the processor knows where to continue execution. However, this also means that if you overwrite data on the stack, you're not only overwriting data, you're overwriting the return address for the current function. So when the function finishes execution, it won't return to the correct instruction. At the very least, this can cause the program to jump to an incorrect address and crash the program. But, if a clever sequence of bytes is written, it possible to tell the program to jump to an address that we've overwritten with our own instructions. We could then execute any code on the remote machine that has equal access privileges as the program. And, now that such exploits are well known, we don't even have to be particularly clever in writing good code, as there is exploit code freely available that will, for example, execute a shell on the remote machine. This would effectively give us full access to that machine, or at least however much access the person running Gaim has.

(For a better description of buffer overflows, see the link on the IA website to the "Smashing the Stack for Fun and Profit" article.)

There is some good news about this particular buffer overflow, however. The setup of the formatting string only allows, at the most, the characters "A-Za-z0-9.~_/:*!@&%%?=+^-" to be printed to the path variable. This means that even if we can overwrite stack, we have to overwrite it with just these characters, making it difficult to exploit this overflow. However, even if we couldn't execute arbitrary code, we could still easily crash the system, and the potential exists that someone could devise malicious code that uses only these characters.

So, now that we know the problem, what can be done about it? The solution in this case is very simple. The sscanf() function, in addition to specifying how to read the input, can also specify the maximum number of characters to into any particular field of the input. After the % character in the formatting string, we simply write the maximum size of that input. So when we generate the formatting string with the g_snprintf() call, all problems would be solved by using the following code:

g_snprintf(scan_info, sizeof(scan_info),
           "%%255[%s]:%%5[%s]/%%255[%s]", addr_ctrl,
           port_ctrl, page_ctrl);
Or, if we didn't want to hardcode values into the code, we could just use:
g_snprintf(scan_info, sizeof(scan_info),
           "%%%d[%s]:%%%d[%s]/%%%d[%s]", sizeof(host) - 1,
           addr_ctrl, sizeof(port_str), port_ctrl,
           sizeof(path) - 1, page_ctrl);
Now, if a malicious person sends an oversized url to Gaim, the url will just be truncated. No longer will this overwrite any data on the stack.

So, in a more broad sense, what can be done about this problem? These kinds of buffer overflows have been well understood for more than 10 ppyears, yet people still make the same programming mistake that allow for such exploitation. The only real solution is educating programmers to know that when you make any call to scanf() or sscanf() that reads input to a string, you have to specify a maximum size for the string.

On the plus side, it does seem like progress is being made in this direction. Look at the call to g_snprintf(). The function snprintf() in this case g_snprintf() is just Gaim's own version of the function is well known as a safe replacement to sprintf(). The sprintf() function is kind of the opposite of the sscanf() function: it writes formated data to a string. But, just like before, if the length of the string we're going to write is bigger than the size of the array to which it is being written, it can overflow the stack and lead to the same kind of exploit. So it has become a somewhat well-established principle of good programming that instead of using the sprintf() function, you should always use the snprintf() function, which allows you to specify a maximum length of the output string, such that it will never write data beyond this length. And that is exactly what the authors of Gaim did. The irony is that the call to g_snprintf() is not based on user input, and therefore there is no potential to overwrite the stack. The call will always generate the same string, which will always have the same length, and a malicious person could not exploit this to execute arbitrary code on the system. But despite this, the Gaim authors still use g_snprintf(). And although it's unnecessary, I would not say this is a bad thing though, as it is just reinforcing the habit that programmers should use this function instead of it's less-safe counterparts. All that needs to happen, then, is that programmers need to develop the same mindset about specifying maximum string lengths when using scanf() or sscanf(). If we can do this, then we go a long way towards decreasing the number of buffer overflow exploits present in our applications.