Threads, slow and buggy

When a lot of people speak about "multi-threading" there is often this black and white view of the world where on one side you create mutliple threads that share all state and use mutexes etc. and on the other every single instruction from your application to the device driver is completely synchronous. Real life is not black and white like that, so we need some words we can use to describe the greyness:

Terminology

task
A context that the OS needs to run instructions
process
A task that doesn't implicitly share any state with any other tasks
thread
A task that does implicitly share state with a group of other tasks
multi-tasking
Running multiple tasks at the same time for a common goal
blocking interface
Something that can stop a task for a significant amount of time
asyncronous interface
An interface that has a "start phase" and an "end phase" with a magical "middle phase" that occurs while the current task does something else. Can be thought of as:
task1 task2
start phase ...
send data recv data
... middle phase
recv data send data
end phase ...

What most people want to think about is an asyncronous interface or multi-tasking, neither imply threading and indeed it is my argument that threading is the worst initial approach to solve either problem.

Mis-conceptions of complexity

Threading is often seen as being "simple", the main reason being because you are encouraged not to think about how data moves between X and Y, as against when using explicit sharing (Ie. processes) where someone has to define how data moves between X to Y. The mental model for this is to assume that an async interface is easily created by doing:

task1 task2
pthread_create
... middle phase
... CHANGE DATA in task1

...this mental model has a major flaw, in that what happens in "..." in task1 is ignorant of "middle phase" in task2, even though task2 must "CHANGE DATA in task1".

Real life example

A somewhat common problem in Unix is that there is no good asyncronous interface to change a network address into a hostname. The blocking code looks like:


/* this is the simple "blocking" code: */
void dns_resolve_block(dns_t *dns, const char *addr, int len, int type)
{
  struct in_addr address;

  inet_aton(dns->ip, &addr);
  if ((ret = gethostbyaddr(addr, len, type)) == NULL)
      strncpy(dns->dns, dns->ip,        sizeof(dns->dns));
   else
      strncpy(dns->dns, result->h_name, sizeof(dns->dns));
  dns->dns[sizeof(dns->dns) - 1] = 0;
}

The naive asyncronous way to solve this problem is to use threading and goes something like (and I've seen people propose this type of code "for simplicity" more than once):


/* NOTE: very buggy,
   simplistic threading "solution" to blocking DNS interface */
*/

static void *dns_resolve__thread1_ip2host(void *arg)
{
  dns_t *dns = arg;
  struct in_addr address = NULL;
  struct hostent *ret = NULL;

  assert(dns);

  inet_aton(dns->ip, &addr);

  if ((ret = gethostbyaddr(addr, len, type)) == NULL)
    strncpy(dns->dns, dns->ip,        sizeof(dns->dns));
  else
    strncpy(dns->dns, result->h_name, sizeof(dns->dns));
  dns->dns[sizeof(dns->dns) - 1] = 0;

  list_push(dns_list, dns); /* cache */
  pthread_exit(NULL);
}

void dns_resolve_thread1(dns_t *dns, const char *addr, int len, int type)
{
  pthread_t thread;
  pthread_attr_t attr;
  struct in_addr address;
  node_t *node = NULL;

  for (list_next(dns_list, &node); node != NULL; list_next(dns_list, &node))
  {
    dns_t *scan = node->data;

    assert(scan);

    if (!strcmp(dns->ip, scan->ip))
    { /* cache */
      memcpy(dns->dns, scan->dns, sizeof(dns->dns));
      return;
    }
  }

  strncpy(dns->dns, dns->ip, sizeof(dns->dns));
  dns->dns[sizeof(dns->dns) - 1] = 0;
  
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);

  if (pthread_create(&thread, &attr, dns_resolve__thread1_ip2host, dns) != 0)
  {
    /* error msg ... */
  }
}

Even assuming the list_*() functions do all the appropriate locking, there are severe design problems with the above. Starting with the system problems, SuS3 says:

The gethostbyaddr() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe.
...
The gethostbyaddr() and gethostbyname() functions may return pointers to static data, which may be overwritten by subsequent calls to any of these functions.

So not only is that call not safe is multiple queries happen at once, but it isn't safe if any other part of task1 calls gethostbyname(). Then all the writes to dns are thread-unsafe as array copies are not atomic, and if task2 is running on a different processor than task1 the writes to memory are not even required to be seen in the same order as they happened (in fact the implicit lock in list_push() is the only thing that guarantees that task1 sees the updated memory at all).

This is often then argued to be a "problem" outside of the threading code, as everyone else should be helping the threaded programer not have to design anything. The result is then imagined to look something like:


/* NOTE: slightly less buggy,
   simplistic threading "solution" to blocking DNS interface */
*/

static pthread_mutex_t dns_resolve__thread2_mutex = PTHREAD_MUTEX_INITIALIZER;

static void *dns_resolve__thread2_ip2host(void *arg)
{
  dns_t *dns = arg;
  struct in_addr address = NULL;
  struct hostent *ret = NULL;

  assert(dns);

  if (pthread_mutex_lock(&dns_resolve__thread2_mutex))
    abort(); /* can't handle errors */

  safe_threaded_inet_aton(dns->ip, &addr);

  if ((ret = gethostbyaddr(addr, len, type)) == NULL)
      safe_threaded_strncpy(dns->dns, dns->ip,        sizeof(dns->dns));
   else
      safe_threaded_strncpy(dns->dns, result->h_name, sizeof(dns->dns));

  if (pthread_mutex_unlock(&dns_resolve__thread2_mutex))
    abort(); /* can't handle errors */
  
  list_push(dns_list, dns);
  pthread_exit(NULL);
}

void dns_resolve_thread2(dns_t *dns, const char *addr, int len, int type)
{
  pthread_t thread;
  pthread_attr_t attr;
  struct in_addr address;
  node_t *node = NULL;

  for (list_next(dns_list, &node); node != NULL; list_next(dns_list, &node))
  {
    dns_t *scan = node->data;

    assert(scan);

    if (!strcmp(dns->ip, scan->ip))
    {
      safe_threaded_memcpy(dns->dns, scan->dns, sizeof(dns->dns));
      return;
    }
  }

  /* this is fine to not use the safe_threaded_*() variant */
  strncpy(dns->dns, dns->ip, sizeof(dns->dns));
  dns->dns[sizeof(dns->dns) - 1] = 0;
  
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);

  if (pthread_create(&thread, &attr, dns_resolve__thread2_ip2host, dns) != 0)
  {
    /* error... */
  }
}

Now although, in theory, the problem of the multiple writes to the string is solved and calls to gethostbyaddr() are serialized. However there are almost as many problems still present.

Due to how gethostbyaddr() is defined any other call to it outside of the above can still cause problems and any calls to gethostbyname() can also affect it. And although the safe_threaded_strncpy() etc. calls protect multiple threads that want to change those strings, all the readers are competing with the writers and so can still see partial data.

This leads to:


/* NOTE: just contains subtle bugs,
   "simplistic" threading "solution" to blocking DNS interface */
*/

static void *dns_resolve__thread3_ip2host(void *arg)
{
  dns_t *dns = arg;
  struct in_addr address = NULL;
  struct hostent *ret = NULL;

  assert(dns);

  safe_threaded_str_lock(dns->ip);
  safe_threaded_str_lock(dns->dns);

  inet_aton(dns->ip, &addr);

  safe_threaded_lock_dns();
  
  if ((ret = gethostbyaddr(addr, len, type)) == NULL)
      strncpy(dns->dns, dns->ip,        sizeof(dns->dns));
   else
      strncpy(dns->dns, result->h_name, sizeof(dns->dns));
  dns->dns[sizeof(dns->dns) - 1] = 0;
  
  safe_threaded_unlock_dns();
  
  safe_threaded_str_unlock(dns->ip);
  safe_threaded_str_unlock(dns->dns);

  list_push(dns_list, dns);
  pthread_exit(NULL);
}

void dns_resolve_thread3(dns_t *dns, const char *addr, int len, int type)
{
  pthread_t thread;
  pthread_attr_t attr;
  struct in_addr address;
  node_t *node = NULL;

  for (list_next(dns_list, &node); node != NULL; list_next(dns_list, &node))
  {
    dns_t *scan = node->data;

    assert(scan);

    safe_threaded_str_lock(dns->ip);
    safe_threaded_str_lock(scan->ip);
    if (!strcmp(dns->ip, scan->ip))
    {
      safe_threaded_str_lock(dns->dns);
      safe_threaded_str_lock(scan->dns);
      memcpy(dns->dns, scan->dns, sizeof(dns->dns));
      safe_threaded_str_unlock(dns->dns);
      safe_threaded_str_unlock(scan->dns);

      safe_threaded_str_unlock(dns->ip);
      safe_threaded_str_unlock(scan->ip);
      return;
    }
    
    safe_threaded_str_unlock(dns->ip);
    safe_threaded_str_unlock(scan->ip);
  }

  safe_threaded_str_lock(dns->ip);
  safe_threaded_str_lock(scan->ip);
  strncpy(dns->dns, dns->ip, sizeof(dns->dns));
  dns->dns[sizeof(dns->dns) - 1] = 0;
  safe_threaded_str_unlock(dns->ip);
  safe_threaded_str_unlock(scan->ip);
  
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);

  if (pthread_create(&thread, &attr, dns_resolve__thread3_ip2host, dns) != 0)
  {
    /* error msg ... */
  }
}

Which now looks really ugly, and requires every single access to the dns data to take locks when the majority of the time they are completely unneeded. This also imposes a massive maintenance burden, as there is nothing stopping anyone from not taking the locks and all code paths have to drop them again ... this is also really hard to test for either of those failure modes. The same is true of the "dns" lock. But at least it's now correct, right?

No, it still contains at least one subtle bug. Even though all the list_*() functions have locking there is no mechanism to say you are walking the dns_list so it is impossible to remove something from dns_list in a safe way without adding even more locking (thus, again, making the list_*() function locking worthless.

More Real life

So all the above was obvious, can you spot the "obvious" error(s) in the below then?


/* real life problem1, spot the (at least one) threading error */

static void *prob1__func(void *arg)
{
  const char *ptr = arg;
  
  if (!ptr)
  {
    fprintf(stderr, "strdup(): %s\n", strerror(ENOMEM));
    abort();
  }

  if (thread_safe_func(ptr) == -1)
    fprintf(stderr, "thread_safe_func(%s): %s\n", ptr, strerror(errno));
  
  pthread_exit(NULL);
}

void prob1(const char *ptr)
{
  pthread_t thread;
  pthread_attr_t attr;
  const char *tmp = strdup(ptr);
  
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
  
  if (pthread_create(&thread, &attr, prob1__func, tmp) != 0)
  {
    /* error msg ... */
  }
}

/* real life problem2, spot the (at least one) threading error */

static void *prob2__func(void *arg)
{
  Thread_safe_str *s1 = arg;  
  unsigned long num = 0;
  
  if (!s1)
    abort();

  num = thread_safe_blocking_func();
  
  thread_safe_str_app_cstr(s1, "foo-");
  thread_safe_str_app_num(s1, num);
  thread_safe_str_app_cstr(s1, ".html");

  pthread_exit(NULL);
}

void prob2(const char *ptr)
{
  Thread_safe_str *s1 = NULL;
  pthread_t thread;
  pthread_attr_t attr;
  const char *tmp = strdup(ptr);

  if (!(s1 = thread_safe_str_make()))
    abort();
  
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
  FILE *fp = NULL;
  
  if (pthread_create(&thread, &attr, prob2__func, tmp) != 0)
  {
    /* error msg ... */
  }

  /* ... do stuff ... */

  fp = fopen(thread_safe_str_export_cstr(s1), "r");
  /* read/process file */
  fclose(fp);
  thread_safe_str_free(s1);
}

James Antill
Last modified: Wed May 17 03:07:44 EDT 2006