Vstr documentation -- overview

Contents

  • Download - A list of places and formats to download the library.
  • About - A description of the library, and it's features, speed, testing, usability and standards compliance.
  • subscription - If you wish to subscribe to new releases, you should go here.
  • Tutorial - A tutorial of how you can use the library, and it's features.
  • Cost Tutorial - A tutorial of the costs associated with the different interfaces for manipulating data.
  • Testing - A description of the testing done for the library.
  • Debugging - A description of how to debug code when using the library.
  • String library comparison - Other libraries, and how they compare in features, speed, usability etc.
  • printf() comparison - Other implementations of printf(), and how they compare in standards compliance and ability to register custom formatters.
  • security of string APIs - The security argument for using a string library, also see this page which shows which security problems wouldn't have existed if the programers had used a real string API.
  • speed of string APIs - An analysis of the assumption that a "real" string API has to be slower and use more memory than working directly with the strcpy()/memcpy() functions.
  • API Reference - The API reference documentation.
  • API layout - An introduction to the API, how to think of it so you can easily use it.
  • Reentrancy - How to use the library API with threads and/or signals.
  • Custom Formatters - How to use the custom formatters ability of the printf() like function, while keeping warnings enabled in gcc, or similar static printf() checkers.
  • Examples - Programs using the API, some of which are heavily commented.
  • License - The license that comes with the Vstr string library is the LGPL.

Download Information

Current version is 1.0.15

  • Tar balls are available via. ftp and http.
  • An upto date YUM repository is available (containing only the non debug versions of the rpms)...
    [and-org-james]
    name=And.org James' packages
    baseurl=ftp://ftp.and.org/yum/fc$releasever-james
    
    ...at which point you can just "yum install vstr".
  • I build RPMs for i386, they are available via. ftp and http.
    Pbone.net seem to generate rpms for other architectures (although they are often older versions).
    Freshports seems to have at least a FreeBSD version.
  • If you want to look at the arch repository that vstr is now developed in, then it is available using
    tla register-archive \
        james@and.org--2004-code ftp://ftp.and.org/pub/james/ARCH-2004-code
        tla get james@and.org--2004-code/vstr--main--1
    
    here.
  • ChangeLog file can be found here
  • NEWS file can be found here
  • TODO file can be found here
  • BUGS file can be found here
  • A set of gdb functions, to help with debugging, can be found here

About

Vstr is a string library, it's designed so you can work optimally with readv()/writev() for input/output. This means that, for instance, you can readv() data to the end of the string and writev() data from the beginning of the string without having to allocate or move memory. It also means that the library is completely happy with data that has multiple zero bytes in it.

This design constraint means that unlike most string libraries Vstr doesn't have an internal representation of the string where everything can be accessed from a single (char *) pointer in C, the internal representation is of multiple "blocks" or nodes each carrying some of the data for the string. This model of representing the data also means that as a string gets bigger the Vstr memory usage only goes up linearly and has no inherent copying (due to other string libraries increasing space for the string via. realloc() the memory usage can be triple the required size and require a complete copy of the string).

It also means that adding, substituting or moving data anywhere in the string can be optimized a lot, to require O(1) copying instead of O(n). Speaking of O(1), it's worth remembering that if you have a Vstr string with caching it is O(1) to get all the data to the writev() system call (the cat example below shows an example of this, the write call is always constant time.

As well as having features directly related to doing IO well it contains functions for:
  • a printf like function that is fully ISO 9899:1999 (C99) compliant, also having %m as standard and POSIX i18n parameter number modifiers. It also allows gcc warning compatible customer format specifiers (and includes pre-written custom format specifiers for ipv4 and ipv6 addresses, Vstr strings and more)
  • splitting of strings into parameter/record chunks (a la perl).
  • substituting data in a Vstr string
  • moving data from one Vstr string to another (or within a Vstr string).
  • comparing strings (without regard for case, or taking into account version information)
  • searching for data in strings (with or without regard for case).
  • counting spans of data in a string (the equivalent of strspn() in ISO C).
  • converting data in a Vstr (Ie. delete/substitute unprintable characters or making a Vstr string lowercase/uppercase).
  • parsing data from a Vstr string (Ie. numbers, or ipv4 addresses).
  • easily parsing and wrapping outgoing data in netstrings, for fast and simple (and hence less error prone) network communication
  • the ability to cache aspects of data about a Vstr string, to both simplify and speedup use of the string.
  • the ability to have empty data as part of the string, this is somewhat useful for representing file transfers as a string as you can represent the file data as empty data in the string.
It also has a number of functions for exporting data from a Vstr string so you can easily use data generated with the Vstr outside of the library.

The other unusual aspect of the Vstr string library is that it attaches a notion of a locale to the string configuration and not globally (as POSIX, and pretty much everything else does). This means that you can do Network I/O in the C locale and user IO in the users locale.

For a look at the internal design of the Vstr string library, you can read this. For a look at the main security problems I wanted to solve you can read this.

Testing

The Vstr string library comes with a "make check" test suite with almost twelve thousand lines of code in it. This is over a third of the size of the library itself, and more lines of code than some string library implementations.

The test suite has at least one test for each function call, and at least one usage of each constant. This is automatically checked using the "scripts/tst_coverage_diff.sh" script included in the distribution (note that you need to compile without inline support, or inline functions won't be seen to be part of the test suite).

Using the coverage analysis available with gcc the test suite has coverage of 100% of the source. Ie. every single line of code in the library is run by at least one unit test.

Note that this still doesn't mean that there are no bugs in the code (you need a test for every code path, for that).

Debugging Vstr APIs

While I've tried to make the API simple enough that you don't have to do anything complicated to get things done, there might still be times when you do a bunch of calls that you aren't sure are ok or maybe you get some memory management wrong, and pass invalid/NULL pointers to the Vstr API functions. The easiest way to find out what is going wrong is to compile without inline support and with debug support (Eg. --enable-tst-noinline --enable-debug options to ./configure). Then as you call the functions almost all calls check input values for validity, and all calls that modify a Vstr will check the Vstr both before and after their operations. Finaly if you call vstr_exit() the number of memory allocations/deallocations and mmap()/munmap() operations will be counted and assert() calls will be raised if data hasn't been freed. NOTE: If you are using rpms, then there are already rpms for the debug build ... and they should be accepted as "newer" than the normal rpms so you can just install them while you develop.

As well as that, in all builds gcc attribute support is checked for the following attrbiutes nonnull, pure, const, format and malloc. the const and pure attributes let the optomiser do some things that can be supprising if you are trying to debug something (for example if you call vstr_cmp() and don't use the return value, gcc will never even do the call in the first place) ... so you need to watch for that. The nonnull attribute should catch errors if you obviously pass NULL pointers to function that don't take them, and the format attribute will catch errors in the calls to printf() like functions. However you may want to temporarily disable attributes due to the opomising problems (if so define VSTR_COMPILE_ATTRIBUTES to be 0 before include the vstr.h header).

API Reference Documentation

All exported interfaces are documented, anything which isn't documented isn't guaranteed by the API or the ABI of the library ... so don't use it. There is a script as part of the distribution "scripts/diff_symbols.sh" which checks this. So I haven't just forgotten, if it isn't documented it's undefined what it does and it might change type signature or disappear completely.

  • functions - List, and explanation, of all public functions exported by the Vstr string library. This is also available as a Unix man page.
  • constants - List, and explanation, of all public constants exported by the Vstr string library. This is also available as a Unix man page.
  • structs - List, and explanation, of all publicly readable or writable struct members exported by the Vstr string library.
  • namespace - An explanation of the namespace used by the Vstr string library, anything export that is not in this namespace is a bug (so report it if you see it, but I don't think there is any).

Mental model to the layout of the API

At first glace the Vstr API looks huge as there are over Two hundred and eighty functions. However the API was designed so that you can mentally build functions from an API template ... so instead of having to remember 280 functions you just need to remember 10 to 20 pieces of the API template.

Vstr functions try to obey a template where each part alternates between an object name and an action like...

<namespace> "_" <verb>
<namespace> "_" <verb> "_" <noun>
<namespace> "_" <verb> "_" <noun> "_" <verb>

...or...

<namespace> "_" <noun>
<namespace> "_" <noun> "_" <verb>
<namespace> "_" <noun> "_" <verb> "_" <noun>

...a good example is searching for data in a Vstr string, here is a list of the functions that you can use to search for data in a vstr...

vstr_csrch_chrs_fwd()
vstr_csrch_chrs_rev()
vstr_csrch_cstr_chrs_fwd()
vstr_csrch_cstr_chrs_rev()
vstr_srch_buf_fwd()
vstr_srch_buf_rev()
vstr_srch_case_buf_fwd()
vstr_srch_case_buf_rev()
vstr_srch_case_chr_fwd()
vstr_srch_case_chr_rev()
vstr_srch_case_cstr_buf_fwd()
vstr_srch_case_cstr_buf_rev()
vstr_srch_case_vstr_fwd()
vstr_srch_case_vstr_rev()
vstr_srch_chr_fwd()
vstr_srch_chr_rev()
vstr_srch_chrs_fwd()
vstr_srch_chrs_rev()
vstr_srch_cstr_buf_fwd()
vstr_srch_cstr_buf_rev()
vstr_srch_cstr_chrs_fwd()
vstr_srch_cstr_chrs_rev()
vstr_srch_vstr_fwd()
vstr_srch_vstr_rev()
vstr_cspn_chrs_fwd()
vstr_cspn_chrs_rev()
vstr_cspn_cstr_chrs_fwd()
vstr_cspn_cstr_chrs_rev()
vstr_spn_chrs_fwd()
vstr_spn_chrs_rev()
vstr_spn_cstr_chrs_fwd()
vstr_spn_cstr_chrs_rev()

...which is a lot of functions just to search for some data. However that can be broken up into...

<namespace><verb><noun><verb>
"vstr_" cspn buf fwd
csrch chr rev
spn chrs
srch cstr_buf
srch_case cstr_chrs
vstr

...which is much less information to remember. It is also consistent in that the same object names are used everywhere and are prefixed by cstr_ when they're length is assumed by looking for a NIL terminator.

Threads and signals

All operations are local to the object(s) they are manipulating, and no locking is done inside the library. Synchronization belongs above simple data type primitives like strings. Saying that if you want to use the Vstr string library from multiple threads, then everything should mostly just work if you have a separate Vstr configuration for each thread and operate on strings created by those configurations local to that thread. Using vstr_conf_swap() you could have a pool of objects using Vstr strings and then localize them to a thread's configuration as you want to operate on those objects.

For all data that you wish to move between two Vstr strings that are "owned" by different threads you will need to do some higher level locking around the copying. One caveat is if you have a Vstr_ref node inside a Vstr string, and then copy that to a string owned by another thread (or do a VSTR_TYPE_ADD_BUF_REF or VSTR_TYPE_ADD_ALL_REF copy of any data) there will be unlocked reference counting on the Vstr_ref ... so basically you can't do that unless you really know what you are doing.


For Vstr string operations you wish to do from a signal handler, life is more complicated, unless you're using a malloc() implementation that is guaranteed to be reentrant safe (this is generally not the case, and not the same as a thread-safe malloc() ... as you can be inside malloc() when you get a signal). The obvious way to get around this is to pre-allocate enough storage in the Vstr configuration to be used in the signal handler, Ie. call vstr_make_spare_nodes(). If you absolutely need to use a Vstr string in a signal handler, that is also used outside a signal handler, you would need to block the signals it could be accessed in around each manipulation of it (or each access to it, if you manipulate it inside a signal handler). Yes, this will be slow, the solution is do not do that.

For most sane uses of signals, the only time you want to do things with strings in the handler is from the SIGSEGV handler, so you can create some debugging information etc. At which point you can probably just do it.

Custom formatters

If you want to write a number to a string in C, you would normally write code such as...

 sprintf(my_str, "%d", num);
	  

...and to append the same to a Vstr string it's a simple API change to...

vstr_add_fmt(my_vstr, my_vstr->len, "%d", num);
	  

...however if you want to write an IPv4 addres, a Vstr string or any other type that isn't in ISO 9899:1999 to a string you have to resort to doing to by hand. And if you want to format that output you have to either convert it to a C style string and use the "%s" option to the *printf() like function, or do all the formatting yourself. This is all pretty ugly, often unreliable, slow and takes significant programer resources.

This is where custom formatters can help and give you back code clarity, reliability, speed and ease of use. Assuming you want to print an IPv4 address, then you can initialize the Vstr configuration like so...

vstr_sc_fmt_add_all(my_vstr->conf);
vstr_cntl_conf(my_vstr->conf, VSTR_CNTL_CONF_SET_FMT_CHAR_ESC, '%');
	  

...you then you can write...

struct sockaddr_in sa;
struct in_addr ipv4;

vstr_add_fmt(my_vstr, my_vstr->len, "%-20{ipv4.p}", (void *)&ipv4);
vstr_add_fmt(my_vstr, my_vstr->len, "%*{ipv4.p}", 20,
             (void *)&sa.sin_addr.s_addr);
	  

...and to add the Vstr string you do...

vstr_add_fmt(my_vstr, my_vstr->len, "%*.*{vstr}", 50, 50,
             (void *)my_vstr, 1, my_vstr->len);
	  

...all normal printf() like formatting options work, as you would expect them to including being able to use i18n format specifiers to easily change the orde4r of output for different locales. However if you try the above, you'll note that all of the calls to vstr_add_fmt() will produce warnings with gcc, because "%{" isn't the start of a valid formatting character under gcc's static printf() parsing rules. This deficiency makes custom formatters as used above mostly useless, as you have to either turn warnings off for format strings (which is basically insanity in C) or see at least one warning for every usage of a custom formatter.

To deal with this, the Vstr custom formatter code allows you to work around the static checkers by using the following initialization code...

vstr_sc_fmt_add_all(my_vstr->conf);
vstr_cntl_conf(my_vstr->conf, VSTR_CNTL_CONF_SET_FMT_CHAR_ESC, '$');
	  

...you can then call the custom formatters, using code like...

struct sockaddr_in sa;
struct in_addr ipv4;

vstr_add_fmt(my_vstr, my_vstr->len, "$-20{ipv4.p:%p}", (void *)&ipv4);
vstr_add_fmt(my_vstr, my_vstr->len, "$*{ipv4.p:%d%p}", 20,
             (void *)&sa.sin_addr.s_addr);

vstr_add_fmt(my_vstr, my_vstr->len, "$*.*{vstr:%d%d%p%zu%zu%u}", 50, 50,
             (void *)my_vstr, (size_t)1, my_vstr->len, VSTR_TYPE_ADD_DEF);
	  

...which although it isn't quite as nice as true support for customer formating in static analyzers like gcc it does make sure that custom formatters will not do anything obviously stupid (without producing spurious warnings) and provides complete protection for non-custom formatter calls. One final note is that in all sane environments you don't need the cast to (void *), however it is "in theory" required to be conforming ISO 9899:1999 C.

You may also want to look at the tutorial section on creating custom formatters.

Simple and heavily commented examples

Note that some of these are explained in much more detail in the tutorial. To get a rough overview of how to use the library you can see the following heavily commented examples:

  • hello world - A program that just does a "Hello World".
  • cat - A program that just transfers information from stdin to stdout ... mmap() is never used.
  • nl - This does something similar to the Unix "nl" program. Each input line is printed along with a number. This uses the split functions to get each line.
  • hexdump - An ASCII hexdump utility, it prints in the "programmer" format of 16 bytes per line with Hex first, and then ASCII. High ASCII characters are printed, or not depending on a configuration. This also uses mmap() on the input files, if possible.
  • lookup_ip - A program that will change a DNS domain name (Ie. www.and.org) into an IPv4 address (or print an error). This uses the std. custom formatter for printing IPv4 addresses.
  • gmp_factorials - A program that will output each factorial leading upto a number supplied on the cmd line. This creates a custom formatter for printing GMP mpz_t variables.

To get a better understanding, there are other example programs which aren't as heavily commented but should show how you can solve certain problems. They are:

  • rot13 - This does a rot13 transform (or Caesar cipher) on the input, and then outputs it. This shows the different methods you can use for iterating through a string. You can also use it to measure the differences between those methods.
  • yes - This just shows text repeatedly, it does show the difference between the different methods of copying data between Vstr strings. You can use this program to measure those copying affects. One minor note is that it does allow you to use the string "--version" as the output ... which the GNU version doesn't (as of fileutils-2.0.12).
  • csv - This is a csv parser I wrote for a thread on comp.lang.c. I'll probably turn into a library function at some point, but for now it's useful on it's own.
  • monitor copy - This program will monitor a file as it gets bigger, and report Bytes per second etc. statistics. This is done through Vstr string custom formatter API calls, and should show how much simpler you can make your code using those features.
  • slowcat (this also requires the Timer_q library, and getopt_long) - This is like the cat program, except that you can specify a delay between blocks of output ... this is especially useful for ASCII art programs.
  • C to html converter - This program implements something similar to , but just for C, this is used to turn the tutorial examples into html.
  • Simple Server Side Includes processor - This program implements something similar to Apache httpd SSI, but it only does the include statement. I use this to generate some documentation.
  • HTTP/1.1 server - Serves static files over HTTP/1.1, this shows off Vstr how it was designed to be used ... as a non-blocking IO server.
  • Scatter gather comparison (this also requires the OpenSSL library) - This program times the usage of the MD5 or SHA1 functions from OpenSSL, with a all of: A single chunk of data, an iovec list of the data at the default buffer size, and an iovec list of the data at a configured buffer size. This was written to disprove the theory that iovec lists are always too slow to use, I believe I did this, assuming you have a chunk size that isn't tiny (and the default chunk size is probably good enough).
  • Adding data to Vstr comparison - This program times the overhead of adding data to a Vstr, in a few different ways. Note that the hand optimized way depends on internals that aren't guaranteed by the API, however the hand iovec method is basically as fast and only uses the documented API (this isn't surprising in my opinion, as the library was designed with the idea that the performance critical data would be comming in through iovecs).

All of the examples can be seen HERE.

For the truly adventurous the "make check" test suite root is HERE (NOTE: the test suite is written to try and break the Vstr string library, so although it uses all of the APIs it may not be code you want to copy and paste into your programs/libraries -- however given that everything in the test suite works, you know that those uses do work). However do note that a couple of the tests do use undocumentation members of structs etc., and you still shouldn't use those.

There is also a "port" of the vsftpd FTP server to use the Vstr string library. It can be found here. This was mainly an experiment in how well/easily Vstr would work inside an application designed for a traditional String API model.


James Antill
Last modified: Mon Mar 6 19:13:24 EST 2006