body { width 85% }; A { background: #FFF; color: #44F; } A:hover { color: #20b2aa; } A.anchor { color: #000; } A.anchor:hover { color: #000; } P { text-indent: 2em; } ul li { list-style-type: lower-roman; }

Design of ustr, the micro string API - for C

Non-Magic version of a ustr

/* As the saying goes:
 *   "you can tell a lot about the code from the data structure".
 *
 * So this is the non-magic version of struct Ustr:
 *
 * struct Ustr
 * {
 *   const char *ptr;
 *   size_t reference_count;
 *   size_t length;
 *   size_t size;
 *
 *   unsigned char flag_is_readonly            : 1;
 *   unsigned char flag_is_fixed               : 1;
 *   unsigned char flag_exact_byte_allocations : 1;
 *   unsigned char flag_has_enomem_error       : 1;
 *   unsigned char flag_fail_on_alloc          : 1;
 * };
 *
 * ...however this "eats" memory for "small" strings, which is one reason given
 * why people don't use a real string ADT.
 *
 */

Magic/real version of ustr

struct Ustr 
{
 unsigned char data[1];
 /* 0b_wxzy_nnoo =>
  * 
  *    w         = allocated
  *     x        = has size
  *      y       = round up allocations (off == exact allocations)
  *       z      = memory error
  *         nn   = refn
  *           oo = lenn
  *
  * Eg.
  *
  * 0b_0000_0000 ==
  *           "" ==
  * not allocated, [no size], [no round up allocs], no mem err, refn=0, lenn=0
  *
  * 0b1000_0001 ==
  *        0xA5 ==
  * allocated, no size, no round up allocs,         no mem err, refn=0, lenn=1
  *
  * 0b1010_0101 ==
  *        0xA5 ==
  * allocated, no size,    round up allocs,         no mem err, refn=1, lenn=1
  */
};

struct Ustrp
{ /* This is just used as a way to catch errors at compile time with static
   * typing. You might think we'd define Ustrp to be identical to Ustr, and
   * just cast however aliasing rules screw us over if we do that. */
  struct Ustr s;
};

Unused bit patterns for a ustr

/* Unused bit patterns for data[0]:
 *
 * 0bx0xx_x100 == lenn = 0
 * 0bx0xx_1x00
 * 0bx0x1_xx00
 * 0bx01x_xx00
 *              0bx1xx_xx00 => is used in sized
 * 0b10xx_xx00
 *
 * 0bx1xx_xx11 == 128bit size_t, yeh!
 * 0bx1xx_11xx
 *
 * 0b0xxx_x1xx == refn values for const/fixed
 * 0b0xxx_1xxx
 *
 * 0b00x1_xxxx == mem error for const
 * 0b001x_xxxx == not exact for const
 *
 */

English description of struct Ustr

This strucure is very magic, as an optimization "" is a valid "structure" for an empty string, ustr_len("") == 0, ustr_size("") == 0, ustr_cstr("") == "".

This also works for other "static data strings" if they start with the first four bits as 0 ... although for C const strings using this, the length has to be manually defined correctly at compile time.

However it is also a "space efficient" String ADT, with a managed length, _either_ with or without reference counts (so that ustr_dup() doesn't copy) and with or without a stored size (so growing/shrinking a lot doesn't cause lots of malloc/realloc usage). Note that for lengths of strings, or reference counts that need >= 8 bytes of storage the Ustr must contain a stored size.

It can also track memory allocation errors over multiple function calls.


If the first byte == 0, then it's "" as above.

Otherwise the first byte contains four flags, and 2 numbers (2 bits each).

The 1st flag says if the string is allocated. The 2nd flag says if it has a stored size. The 3rd flag says if it isn't using "exact allocations", if it is ustr_len() + ustr_overhead() == ustr_size_alloc(), otherwise there might be room to grow without doing a memory allocation call[1]. The 4th flag says if this Ustr has had a memory allocation error. The two numbers are the mappings for the length of the reference count, and the length of the length. Note that this mapping changes if the 2nd flag is set.

[1] There is an implied "size" of the string, based an how big it is. This is a concession for speed efficiency, although it only grows at 1.5x not 2x which is the norm. (again, to reduce memory waste). Again if this is too much overhead, just use exact sized allocations.

Also NOTE if the 1st and 2nd flags are set, this means that the Ustr is in fixed storge (like an auto array). Also if the 1st, 2nd and 3rd flags are set, this means that the Ustr is limited, Ie. it's in fixed storge and cannot grow (all allocation attempts will fail).

If there is any more data after the above declared one they have been allocated via. the "struct hack" method (search for more info).

Next, possibly, comes the reference count (of the given length[2]).

Next, if not "", comes the length of the data (of the given length[2]).

Next, if non-zero length, comes the data, which can include NIL bytes.

And finally, if not "", a NIL byte, to make sure it's always a valid C-Style string (although obviously any embeded NIL bytes will make it look shorter if something else just uses strlen(ustr_cstr())).

[2] The sizes can currently be 1, 2, 4 or 8 bytes. Depending on the mapping of the value in the first byte (length changes dynamically, as you add data).

Examples

The allocated string "a", using exact allocations and no reference counting, is the 4 bytes:

     {0x81, 0x01, 'a', 0x00}

The allocated string "ab", using non-exact allocations and 1 byte reference counting (shared by 13 users), is the 6 bytes (there is no waste due to non-exact allocations):

     {0xA5, 0x0D, 0x02, 'a', 'b', 0x00}

The allocated string "abcdefghijklmnop" (16 bytes, and a NIL), with non-exact allocations 2 byte reference counting (shared by 1,003 users), is the 24 bytes (with 3 bytes of "unused" space at the end):

     {0xA9, 0x03, 0xEB, 0x10, 'a', 'b', [...], 'o', 'p', 0x00, <x>, <x>, <x>}

Number of options for strings

This design allows 83 different combinations of options for your string types:


James Antill
Last modified: Tue Aug 28 01:06:50 EDT 2007