Schema and Disk Dumps from C

  • Changes to the Schema package in V3.3
  • Examining Schema from C
  • Introduction to Diskio
  • How to Write Data to Disk
  • Dumping Data, the Whole Story
  • C Functions for Diskio
  • Details of Internals
  • Changes to the Schema package in V3.3

    In version v3.3 of dervish schema definitions are local to a package rather than being global; the definition of a `package' is up to you; it is all the types that you decided to lump together when you ran make_io. The name of the package is specified on the make_io command line, for example dervish/src/Makefile contains the lines DISKIO_FILES = $(INC)/region.h $(INC)/shCList.h # diskio_gen.c : make_io $(DISKIO_FILES) rm -f diskio_gen.c $(DERVISH_DIR)/bin/make_io -v1 -m Dervish diskio_gen.c $(DISKIO_FILES) chmod 444 diskio_gen.c which defines a package called dervish (the -m flag) containing the schema defined in shCList.h and region.h. Usually all generated dump functions will start with the first two letters of the module name (e.g. shDumpIntRead), but you can specify a two-character prefix explicitly using the -p flag if the fancy takes you (e.g. -p xx).

    You load the schema from a package into your program by calling the function

    shSchemaLoadFromPackage() where <Package> is the name of your package, for example shMainTcl_Declare calls shSchemaLoadFromDervish, and the test program dervish_foo which declares a package called test calls shSchemaLoadFromTest. There is currently no check that type names are unique; the first one will be used (but note that the name of a schema type is the C name of the type, so if you actually use two types of the same name in the same file the C compiler will get very upset). Still, this will be fixed in a future release.

    Things that you'll have to change in your code:

    1. Makefiles that use make_io need a number of changes:
      1. Remove all references to diskio_gen.h (or whatever you called it; the name diskio_gen.c is not hard coded into make_io but rather specified on the command line).
      2. Add a -m package_name flag to the make_io command.
      3. Remove all include files from the list used by make_io that don't belong in your package.
    2. The function shSchemaInit() should be removed; it has been replaced by one or more calls to shSchemaLoadFromPackage. In particular, you'll need to load your new schema (with the name specified to make_io with -m).
    3. You must remove lines that include types.h; the file is no longer needed and it no longer exists.
    4. Any reference in your code to things like TYPE_REGION must be replaced by calls like shTypeGetFromName("REGION"); as this is a constant you cannot use a switch statement to select code based on a type. Probably the easiest way to proceed is to use strcmp and the names of types; if the inefficiency of this approach makes you unhappy, and if you are prepared to do a little more work you can still use types directly, but remember that they are no longer compile time constants.
    5. The way that you annotate include files to change the behaviour of make_io has changed. Whereas you used to use commands like DUMP_SCHEMA we now use an explicit pragma comment; for example the MASK definition looks like typedef struct mask{ char *name; /* identifying name */ int nrow; /* number of rows in mask */ int ncol; /* number of columns in mask */ char **rows; /* pointer to pointers to rows */ int row0,col0; /* location of LLH corner of child in parent */ struct mask_p *prvt; /* information private to the pipeline */ } MASK; /* pragma NOCONSTRUCTOR pragma USER */ The possible pragmas are:
      AUTO
      Use generic diskio (dump) code. Make handles automatically, that is, there is no need to write the code to simulate the TCL verbs typeNew and typeDel. Has precedence over USER
      CONSTRUCTOR
      Make handles automatically, that is, there is no need to write the code to generate the TCL verbs typeNew and typeDel. NO dumping of the associated structure is allowed.
      IGNORE
      Ignore this type/structure completely. Do not create schema for it, no handle making or dumping. Overrides any other pragmas.
      SCHEMA
      Instructs the schema generation code to generate schema for this type/structure. No dumping or handle making. Overrides AUTO, CONSTRUCTOR and USER.
      USER
      The user will provide read/write code (used for dumping to and reading back from disk) for the associated structure. This is necessary for complex types like REGIONS. No handle making. If not specified, AUTO is assumed.
      Any of these may be preceeded with NO to request the opposite effect. Default values are AUTO CONSTRUCTOR NOIGNORE SCHEMA NOUSER.
    6. As implied by the above, you do not need to write a TCL interface to simple types, it is done for you.
    7. The functions to access SCHEMA and SCHEM_ELEMs now return a pointer to const; you may have to modify your code.
    The example dervish_foo has been updated to use all of these features; it now supports a simple type, FOO, and a complicated one, BAR.

    Examining Schema from C

    Schema are described in terms of two structs, defined in shCSchema.h. The C functions that can be used to work with these are:
  • shSchemaGet
  • shSchemaElemGet
  • shElemGet
  • shElemSet
  • shPtrSprint
  • shDumpSchemaElemRead
  • shSchemaNew
  • shSchemaGet

    Return a type's schema, given the name of the <type>. const SCHEMA *shSchemaGet(char *type);

    shSchemaElemGet

    Return the schema of a member <elem> of a <type>. const SCHEMA_ELEM *shSchemaElemGet(char *type, char *elem);

    shElemGet

    Return a pointer to the element described by <sch_el> of the object <thing>. The type is returned in <type> (if it isn't NULL). void *shElemGet(void *thing, SCHEMA_ELEM *sch_el, TYPE *type);

    shElemSet

    Set the element described by <sch_el> of the object <thing> to <value>. RET_CODE shElemSet(void *thing, SCHEMA_ELEM *sch_el, char *value);

    shPtrSprint

    Return a string containing a printed representation of <ptr>, taken to be of the given <type>. The string is stored in a buffer resident to shPtrSprint. char *shPtrSprint(void *ptr, TYPE type);

    shDumpSchemaElemRead

    Read the element described by <sch_el> of a dumped structure <thing> from the dump file pointer <fil>. RET_CODE shDumpSchemaElemRead(FILE *fil,void *thing,SCHEMA_ELEM *sch_el);

    shSchemaNew

    Allocate a new SCHEMA and <nelems> SCHEMA_ELEM's. The SCHEMA struct is filled with zeros or NULL's, except for the type, which is set to UNKNOWN, and the pointer to SCHEMA_ELEM's, which is set to point to the allocated SCHEMA_ELEM's. The SCHEMA_ELEM's are initialized to zero. Note that the nelem member of each SCHEMA_ELEM is set to zero; you must allocate memory and fill with an appropriate char string and set nelem to point to this string. The SCHEMA returned by shSchemaNew is guaranteed to be followed by a zero byte; this allows the structure to be passed to p_shLoadSchema to load into the system SCHEMA tables. SCHEMA *shNewSchema(int nelems);

    Introduction to Diskio

    The disk dump (`diskio') facility enables you to stop and restart Dervish, saving the variables to disk. The format is binary and complicated, you'd never want to write code to read it yourself, but fortunately there is a programmer's interface; all of these routines are also available from TCL.

    If I may be permitted to boast for a few lines, these data dumps are machine and compiler independent (providing that the floating point format is IEEE and the type long is a 4-byte integer; both of these restrictions could easily be lifted); in particular they assume neither a byte order for integers nor the length of an int (I have successfully written dumps on a 16-bit PC and read them on a sun). Any pointers in the data are tracked down, and reinstated when the dumps are read. In some cases a request to dump a set of variables may not specify all the required data (for example, an object list may contain subREGIONs of undumped REGIONs); in this case a warning is issued and the missing data is appended to the dump as `anonymous' structures.

    How to Write Data to Disk

    It is essential to realise that pointers are not written to the dump file, instead smallish integers are used. This means that until and unless a file has been fully processed (i.e. closed without error) it is dangerous to dereference pointers in your newly read data structures.

    To use dump files you need to include the proper include files, namely photo.h and shCDiskio.h in that order. You will need standard C header files too, or at least <stdio.h>.

    Dump files are opened with

    FILE *shDumpOpen(char *name, char *mode); the return codes are SH_SUCCESS and SH_GENERIC_ERROR; the file is called name; and permitted modes are "a", "r", and "w" for append, read, and write. If you want to use "a" you'll have to read the next section too.

    Once a file is opened for append or write you can write data structures to it with the functions defined in shCDiskio.h, for example shRegWrite. When you have written all that you want, close the file with shDumpClose. Please note that you should not be lazy and use fclose, as shDumpClose cleans up various internal structures. If the file was opened for read it also initialises pointers within your data structures; if if was opened for write (or append), it checks that all the data structures referenced by the things that you wrote have actually been written, and writes any that you missed. If these activities fail, it returns SH_GENERIC_ERROR.

    How should you read back a dump? The simplest way is to use LIST *shDumpRead(FILE *fil, int shallow) which returns a list of the contents of the file. If shallow is 1 (true) the data in the file isn't actually read into your program; only as much as is needed to correctly parse the file is read so you should not attempt to dereference pointers inside the returned data structures (the only exceptions being name, nrow, and ncol in MASKs and REGIONs, and testing first against NULL for lists). If shallow is 0 (false) the whole dump is read into memory.

    The returned LIST is of THINGs (defined in shCDiskio.h):

    typedef struct struct_thing { TYPE ltype; /* used by LIST stuff */ struct struct_thing *next, *prev; void *ptr; TYPE type; } THING; TYPE is defined in shCSchema.h. You can then go through the list examining what you interests you: FILE *fil; LIST *list; THING *thing; fil = shDumpOpen(file,"r"); list = shDumpRead(fil,1) shDumpClose(fil); thing = (THING *)list->first; printf("Date: %s\n",(char *)thing->ptr); for(thing = thing->next;thing != NULL;thing = thing->next) { printf("%s",shNameGetFromType(thing->type)); switch(thing->type) { case TYPE_MASK: /* ... */ } } (I have removed some error checking). You'll see that the first THING is the date string (type TYPE_STR). The function shNameGetFromType is used to convert an enumerated type such as TYPE_OBJ1 into a string such as "OBJ1". There is a rather more complete example of dump-reading code in $DERVISH_DIR/examples/dump_list.c (it also prints out the schemas).

    Dumping Data, the Whole Story

    (Some of the details in this section are out of date. Please ask Robert Lupton for updated help.)

    If you don't want to read the whole dump you'll have to do a little more work.

    Firstly, I didn't tell you the whole story about opening dumps; if the mode letter is capitalised (e.g. "R") no cleanup is done when the file is closed. You'll need to remember this in a moment. If you haven't disabled this cleanup, shDumpClose will return SH_GENERIC_ERROR if unresolved pointers remain, and ignoring this return value is a short cut to a segmentation violation. Nothing irreversible is done when a dumpfile is closed, but when a file is opened some internal data structures are freed; you can avoid this by using shDumpReopen() which is otherwise identical to shDumpOpen. If you have been playing complicated games (e.g. with appending stuff with mode "A") you may need to use shDumpReopen.

    For every Dervish (and most C) datatypes there are two functions defined in shCDiskio.h, for example:

    int shMaskRead(FILE *fil, MASK **thing) int shMaskWrite(FILE *fil, MASK *thing) The C datatypes supported are char, int, long, float, void * (called ptr), and char * (called str). The type names are capitalised (shMaskRead, shStrWrite).

    There is a function int shDumpTypeGet(FILE *fil, TYPE *type) that can be used to return the type of the next object written to the file (it'll return SH_GENERIC_ERROR at the end of the file); once you know what the type is you can call the proper read function, readPtr or whatever. There is a function shDumpNextRead that does this for you. If you want to be sneaky and only read some of the dump and skip over other parts, there will in general be pointers in the read data items that point to objects that you haven't read. Usually shDumpClose tries to read any remaining stuff in the dump file in an attempt to find them; you can disable this by opening the file with a mode of "R" (if you prefer dirty hacks, (void)fseek(fil,0L,2) should work too). After thus circumventing checks you may be left with invalid pointers in your data structures; caveat lector. If you know enough to be reading this section, you may know where these bad pointers are (e.g. you didn't read any REGIONs, so don't look at the REGION pointers in OBJ1s).

    C Functions for Diskio

    shDumpOpen
    Open a dump file
    shDumpReopen
    Reopen a dump (don't init data structures)
    shDumpClose
    Close a dump
    shDumpPtrsResolve
    Resolve pointer ids
    shDumpDateDel
    Replace a dump's date string with Xs
    shDumpDateGet
    Return a dump's date string
    shDumpTypeGet
    Return the TYPE of the next item in a dump
    Functions to read/write something in a dump:
    shDumpCharRead
    shDumpCharWrite
    Chars
    shDumpFloatRead
    shDumpFloatWrite
    Floats
    shDumpIntRead
    shDumpIntWrite
    Ints
    shDumpLongRead
    shDumpLongWrite
    Longs
    shDumpMaskRead
    shDumpMaskWrite
    MASKs
    shDumpPtrRead
    shDumpPtrWrite
    Pointers
    shDumpRegRead
    shDumpRegWrite
    REGIONs
    shDumpStrRead
    shDumpStrWrite
    Strings
    Extern functions that are really only for friends:
    p_shDumpStructsReset
    Reset the structs in a dump file
    p_shPtrIdSave

    Details of Internals

    The internals of Dervish structures are liable to change. In consequence, the dump package attempts to extract almost all the information that it needs from the header files defining them. There is a program, make_io, which reads them and finds all structs that are typedef'd to something, and all typedef'd enums. It then generates two functions for each type, shTypeRead and shTypeWrite, in a file called diskio_gen.c. The prototypes for these i/o functions may be found in shCDiskio.h. They could easily have been machine generated, but they change only when a new type is added, and being forced to copy and two lines from shCDiskio.h into your type's header should remind you to check that there is nothing weird about the new type.

    The i/o functions for MASKs and REGIONs are not machine generated as they are significantly different from the various list and object types; at some point this may change, but they are still likely to be treated specially by make_io. The way that this is achieved is by adding special comments at the end of the structure definitions; those currently supported are:

    NODUMP
    Diskio code should entirely ignore this struct; an example would be LIST_ELEM
    NODUMP-R/W
    Diskio code should not generate Read/Write/Skip functions for this struct; an example is REGION.
    DUMP-SCHEMA
    Just dump the schema, but don't generate any i/o code; for example REGINFO.

    In addition to this struct i/o code, a few more functions are generated automatically, basically those that need to deal with every structure that can exist in a dump file (e.g. shDumpRead and shTypeGetFromName).

    Dump files have headers that include a version number for the format of the dump and a date. When you open a dump with shDumpOpen the header is written or checked, as appropriate. Once the file is opened, the file pointer is left just after the header information.

    There are two main problems in writing dump files such as these: machine independence and pointers. The first is solved reasonably easily; all ints are written as longs, and all integral types are written in network byte order (which is the same as that on a sun or sgi).

    Pointers are much more of a problem. The same program may well use different addresses for the same objects after a trivial change to the source, and certainly the addresses written from a hardworked pipeline will be different from those suitable for reading into a newly-started one. The solution that I have adopted is to map each pointer to a unique id number (of type IPTR, typically long), and to write the id instead of the address. Each object is preceded by its type (the enum TYPE) and its id. By keeping tabs on what has been written, along with the type, it is possible to write all referenced pointers to disk (these are the anonymous structures referred to in the introduction).

    Reading a dump is a little more complex. As we come to each data object, we first read the type, and then its original id. Knowing its type we can allocate space for it, and store the pair (address, id) in a safe place, currently an AVL tree. As each pointer is read its id is looked up in this tree, and if it has already been seen it's replaced by the proper address and all is well. If it hasn't been seen, we store the address of the desired pointer along with the id. When we've read the entire file (or more precisely, when shDumpClose is called) we go through this list and make another attempt to find the correct address; if we find it we can insert it into the proper place (that was why we stored the pointer's address). If this goes well we are almost done; all that remains is to deal with row pointers in submasks and subregions (those that we were unable to process as we read the file), and return SH_SUCCESS.