Encoding data into integer handles for language interpreter

Question 1

As an integral part (pun intended) of my APL interpreter (previous questions: 1 2 3 4), this suite of functions converts any data into integer handles. APL operates on arrays of data cells, so with this encoding my cells are all int but may semantically contain other types. This enables me to have heterogeneous arrays with a homogeneous representation. Elements can even be arrays nested to arbitrary depths.

common.h contains definitions that are needed all over the place in the source. Some are not strictly relevant here, but the master enum of tags and the various implicit pointers, ie. typedef struct verb *verb, are important to be aware of. Hiding pointers in typedefs is questionable, to be sure. But I hope that the very short list of them compensates somewhat. array, verb, xverb, symtab, and magic are all (the) implicit pointers-to-struct types. I omit the contents of ../ppnarg.h because it is not relevant here (but common enough to be common).

common.h:

/*
 * The central concept of encoding data is the use of the basic `int` type
 * for "everything". We chop the 32 bits into an 8 bit tag[*] and 24 bit value.
 * So we can't deal with literal numbers that are larger than 16.7 million 
 * or so. 
 *
 * An `int` which contains one of our encoded-integer values should be 
 * declared `object` to convey this semantics to the reader.
 * Conversely, having determined that an object's tag is LITERAL,
 * code may feel free to treat it as a restricted-range integer value.
 *
 * [*] since we treat negative numbers as encoding to themselves, in essence
 * we only have a 7bit tag to play with.
 */
#ifndef COMMON_H_
#define COMMON_H_
#include <stdarg.h>
#include <stdint.h>
#include <stdlib.h>
#include "../ppnarg.h"
#define MODE1(x) (x+(1<<7)) //add hi bit of ascii char
typedef int object;
typedef union integer {
 uint32_t uint32;
 int32_t int32;
} integer;
enum tag {
 LITERAL, /* val is a 24-bit 2's comp integer */
 CHAR, /* val is a 21-bit Unicode code point padded with zeros */
 PCHAR, /* val is a an executable char */
 MARKOBJ, /* val is irrelevant (s.b. 0) */
 NULLOBJ, /* val is irrelevant (s.b. 0) */
 LABEL, /* the statement number, counting from 1 */
 LPAROBJ,
 RPAROBJ,
 SEMIOBJ,
 RBRACOBJ,
FIRST_INDEXED_TYPE,
 NUMBER = FIRST_INDEXED_TYPE, /* val is an index in the number table */
 PROG, /* val is an (index to an) executable code fragment (array of PCHAR)*/
 ARRAY, /* val is a(n index to a) boxed array */
 SYMTAB, /* val is a(n index to a) symbol table */
 LBRACOBJ, /* val is an (index to an) array of the bracket contents */
 ANALYSIS, /* del function header info */
 MAGIC, /* get/set function pair */
 VERB, /* val is a(n index to a) verb object */
 ADVERB, /* val is a(n index to a) verb object */
 XVERB, /* val is a(n index to a) struct containing a verb and adverb */
 EXPR, /* val is a(n index to a) vector of the expression contents */
 BLOCK, /* val is a(n index to a) vector of expressions, a PROGN */
LAST_INDEXED_TYPE = BLOCK,
};
typedef struct array *array;
typedef struct verb *verb; // also used for adverbs
typedef object nilad(verb v);
typedef object monad(object w,verb v);
typedef object dyad(object a,object w,verb v);
typedef struct xverb *xverb;
typedef struct symtab *symtab;
typedef struct magic *magic;
#ifdef DEBUGMODE
 #define DEBUG(LVL,...) if (LVL<=DEBUGMODE) fprintf(stderr, __VA_ARGS__)
 #define IFDEBUG(LVL,...) do if (LVL<=DEBUGMODE) { __VA_ARGS__; } while(0)
#else
 #define DEBUG(...)
 #define IFDEBUG(...)
#endif
#endif

encoding.h declares a few singleton objects and the functions that are exposed. init_en() must be called before any other encoding functions.

encoding.h:

#ifndef ENCODING_H_
#define ENCODING_H_
#include "common.h"
extern object null;
extern object mark;
extern object nil;
extern object blank;
void init_en();
int gettag(object d);
int getval(object d);
object newdata(int tag, int val);
object cache(int tag, void *ptr);
void *getptr(object d);
object getfill(object d);
#endif

And the implementation. For simple types, the newdata function constructs an encoded value by ORing the tag and value bits. For the pointer types, the cache function saves the pointer in a data bank structure (called memory_bank) and yields an index to be used as the value argument to newdata. The behavior depends upon the careful arrangement of the pointer type tags into a contiguous range so the tag number can be arithmetically manipulated to/from index values.

encoding.c:

/* Encoding
 *
 * this file defines the sub-typing of data atoms.
 * All data are packed into integer handles. The benefit for
 * array operations is all data atoms will have a uniform
 * size no matter what the content actually is. This replaces
 * the intptr_t hackery (ab)used in earlier versions 
 * (not portable to 64bit build). 
 *
 * the array data are always just straight 32bit integers.
 * but we treat as a 7bit tag and 24bit integer value.
 * An immediate integer value is indicated by a negative
 * sign-bit or all-zero tag. In essence, a 25bit sign/magnitude
 * rep with no -0. This also means that we're not really using
 * up all the available bits. Depending upon the final suite
 * of distinct types and the desired "word size", this arrangement
 * might be optimized further.
 *
 * Composite objects (boxed or reference objects) have
 * an associated pointer stored in an array associated
 * with the tag. Thus an array object can be enclosed
 * into a scalar (integer handle) with
 *
 * int x;
 * x = cache(ARRAY, array_new_dims(3,3)); //3x3 matrix
 *
 * To better convey the abstract use of this integer type,
 * we will make use of this typedef to designate such int-handles.
 *
 * commont.h:
 * typedef int object;
 *
 * the array data structure (which is implicitly a pointer
 * to its struct) can be retrived from the handle
 * with
 *
 * array a;
 * a = getptr(x);
 *
 * Most functions will need to check the types of their 
 * arguments in order to determine how to proceed.
 * This can be accomplished with `gettag()`.
 *
 * switch(gettag(x)){
 * case LITERAL: // handle atomic integer
 * break;
 * case ARRAY: {
 * array X = getptr(x); 
 * }
 * }
 */
#include <stdio.h>
#include <stdlib.h>
#include "common.h"
#include "encoding.h"
#include "array.h"
int gettag(object d){
 if (d<0) return 0; /* negatives are literals */
 integer int32;
 int32.int32 = d;
 return int32.uint32 >> 24;
}
int getval(object d){
 if (d<0) return d;
 integer int32;
 int32.int32 = d;
 return int32.uint32 & ((1U<<24)-1);
}
object newdata(int tag, int val){
 if (tag==LITERAL && val<0) return val;
 integer int32;
 int32.uint32 = ((unsigned)tag << 24) | ((unsigned)val & ((1U<<24)-1));
 int x = int32.int32;
 DEBUG(3,"newdata %x(%d %d)\n", x, tag, val);
 return x;
}
integer nulldata;// = { .data = { .tag = NULLOBJ, .val = 0 } };
object null /* = nulldata.int32 */;
integer markdata;// = { .data = { .tag = MARKOBJ, .val = 0 } };
object mark /* = markdata.int32 */;
object nil;
object blank;
void init_en(void){
 nulldata.uint32 = newdata(NULLOBJ, 0);
 null = nulldata.int32;
 markdata.uint32 = newdata(MARKOBJ, 0);
 mark = markdata.int32;
 cache(LBRACOBJ, array_new_rank_dims(0));
 blank = newdata(CHAR, ' ');
}
int addnewtocache(size_t *used, size_t *max, void ***data, void *ptr){
 if (*used == *max){
 *max = *max * 7 + 11;
 void *tmp = realloc(*data, *max * sizeof(void*));
 if (!tmp) return null;
 *data = tmp;
 }
 int z = (*used)++;
 (*data)[z] = ptr;
 DEBUG(3,"addnew %d %p %p\n", z, ptr, (*data)[z]);
 return z;
}
struct memory_bank {
 size_t used, max;
 void **tab;
} memory_bank[LAST_INDEXED_TYPE - FIRST_INDEXED_TYPE + 1];
object cache(int tag, void *ptr){
 if (tag < FIRST_INDEXED_TYPE || tag > LAST_INDEXED_TYPE)
 return null;
 int idx = tag - FIRST_INDEXED_TYPE;
 return newdata(tag,
 addnewtocache(&memory_bank[idx].used,
 &memory_bank[idx].max,
 &memory_bank[idx].tab,
 ptr));
}
void *getptr(object d){
 if (d<0) return NULL;
 int tag = gettag(d);
 if (tag < FIRST_INDEXED_TYPE || tag > LAST_INDEXED_TYPE)
 return NULL;
 int idx = tag - FIRST_INDEXED_TYPE;
 return memory_bank[idx].tab[getval(d)];
}
// fill returns a "blank" value for any type
// and identity elements for verbs
object getfill(object d){
 switch(gettag(d)){
 case PCHAR:
 switch(getval(d)){
 case '+':
 return 0;
 case 0x00d7: // Times
 case 0x00f7: // Divided-By
 case '*':
 return 1;
 } /*fallthru*/
 default:
 case LITERAL:
 return newdata(CHAR, 0x2300); //null
 return newdata(CHAR, 0x2316); //position
 return newdata(CHAR, 0x2218); //jot
 //return newdata(LITERAL, (1<<24)-1);
 case CHAR: return newdata(CHAR, ' ');
 }
}

Additional referenced files from the same commit: array.h ../ppnarg.h. See my SO Question for an in-depth explanation of the array code. ../ppnarg.h implements a macro which counts how many arguments it was called with. This is used to provide variadic constructors for the array type. It's a wonderous little macro that should be part of everyone's cpp arsenal, but I didn't write it.

Any problem spots or infelicities? Any "too clever" spots?

I fear I may have made it difficult to design a garbage collector since my allocations are scattered all over this "data bank". It doesn't lend itself to a generational collector or a copying collector, and even mark-sweep will leave unused handles unless there's an extra free-list just for handles. Is there a better way to design my data bank to make it easier to scan for garbage?

Question 2

For `reference: Wiki note APL

Question 3

comp.lang.c meta-discussion of my GC obsession with this code.

Question 4

Any problem spots or infelicities? Any "too clever" spots?

Overall: Recommend to not causally mix int/unsigned, int32_t/uint32_t and size_t. Too many cases where code would fail by using both pairs. Suggest using int/unsigned instead of int32_t/uint32_t with a #if INT_MAX >= 2147483647 test. (or fix/test the casual conversions.)

Disagree with not detailing ../ppnarg.h. If the contents of ../ppnarg.h are common enough to be common, why is not common.h also considered common enough to be not posted? Lacking this tile, a review cannot completely compile the code. Without compilation, automated tools are prevented from use. Without automated tools, this review request becomes unnecessarily extra burdensome.
common.h is too common. Consider your successful code being used on a larger effort. Perhaps APL_common.h.?
What/where is #include "array.h"?
"retrived" --> "retrieved"
Unclear double fallthru with the comment /*fallthru*/ at the end of a switch statement. It appears to apply to the case PCHAR: which then falls though to the default: Strange to see a case after default. Code uses a /*fallthru*/ comment there but lacks 2 such comments after case 0x00d7: and case 0x00f7:. This inconsistent style confuses. Top level switch/case indents the case. Nested switch/case does not indent the case. Inconsistent style unnecessarily confuses the review. This implies OP is not using automated formatting. Coding styles that depend on manual formatting are high maintenance. Use an automated formatter.
```
switch(gettag(d)){
 case PCHAR:
 switch(getval(d)){
 case '+':
 return 0;
 case 0x00d7: // Times
 case 0x00f7: // Divided-By
 case '*':
 return 1;
 } /*fallthru*/
 default:
 case LITERAL:
```
Incomplete prototype declaration. Using () allows following code to wrongly pass parameters and the compiler does not warn.
```
// void init_en();
void init_en(void);
```
Defeat of data abstraction. integer does not imply size, bit of some size. int32 implies valuable is 32-bit. I'd expect either 1) integer to be integer32 if it is always going to be 32-bit or 2) if integer is abstract, int32 have a name without an implied size.
```
int gettag(object d) {
 integer int32;
 int32.int32 = d;
```
Possible data loss. d is of type object (presently int) and int32.int32 is of type int32_t. This inconsistency should be resolved. Recommend typedef union integer { unsigned u32; int i32; } integer;
```
int32.int32 = d; // Potential narrowing of type.
int getval(object d) {
 if (d < 0) return d; // Potential narrowing of type.
```
Even discussing the type array and struct array is confusing in context with code that uses arrays. Consider a different name.
```
// typedef struct array *array;
// Suggest
typedef struct APL_array *APL_array;
```
Not a fan of using the same name for a type and struct/union. Although these exist in different name spaces in C, they readily collide in a reviewers understanding. Suggest typedef struct APL_array_S *APL_array; or drop the typedef and only use struct APL_array.

Pedantic: Not portable when int < 24 bit.

// int32.uint32 = ((unsigned) tag << 24) | ((unsigned) val & ((1U << 24) - 1));
// more portable
int32.uint32 = (unsigned) tag;
int32.uint32 = (int32.uint32 << 24) | ((unsigned) val & ((1LU << 24) - 1));

Pedantic: Printing int values with %x outside the [0...INT_MAX] range is UB.

int x = ...
// DEBUG(3,"newdata %x(%d %d)\n", x, tag, val);
DEBUG(3,"newdata %x(%d %d)\n", (unsigned) x, tag, val);

Possible loss of info in changing types.
```
size_t *used
...
int z = (*used)++;
```
The type name object misleads. In C, an object can be a struct, union, complex double, etc. Much code depends on object being a real value.
```
void *getptr(object d) {
 if (d < 0) return NULL;
```

Adding common name to global space. These will certainly collide on a large project. Assume your good code will be re-used on larger applications. Suggest putting all the following in a struct variable, maybe called APL.

integer nulldata; // = { .data = { .tag = NULLOBJ, .val = 0 } };
object null /* = nulldata.int32 */;
integer markdata; // = { .data = { .tag = MARKOBJ, .val = 0 } };
object mark /* = markdata.int32 */;
object nil;
object blank;

Question 5

For point 1, I've added links to the files. Is that sufficient or should I inline them here?

Question 6

I see what you mean in point 5, but I felt it was important to highlight the one fall-through that was hard to see, viz. when none of the sub-switch's cases are taken, and the outer case falls through to the default (which itself then falls-on-through to the next case). But all the other falls are easy to see by comparison. Perhaps I should add a default: ; /*fallthru*/ to the inner switch?

Question 7

For 14, instead of object, I've also considered token, datum, atom or entity. Is one of these better, or something else?

Question 8

@luser droog How about APL_object? I'm thinking of how to use your good code in the future with other code.

Question 9

I think the solution for 5 is to break it into 2 or more functions. It's just doing too many things.

chux chux 36.2k2 gold badges43 silver badges96 bronze badges · Accepted Answer · 2016-09-10 13:48:26Z

Any problem spots or infelicities? Any "too clever" spots?

Overall: Recommend to not causally mix int/unsigned, int32_t/uint32_t and size_t. Too many cases where code would fail by using both pairs. Suggest using int/unsigned instead of int32_t/uint32_t with a #if INT_MAX >= 2147483647 test. (or fix/test the casual conversions.)

Disagree with not detailing ../ppnarg.h. If the contents of ../ppnarg.h are common enough to be common, why is not common.h also considered common enough to be not posted? Lacking this tile, a review cannot completely compile the code. Without compilation, automated tools are prevented from use. Without automated tools, this review request becomes unnecessarily extra burdensome.
common.h is too common. Consider your successful code being used on a larger effort. Perhaps APL_common.h.?
What/where is #include "array.h"?
"retrived" --> "retrieved"
Unclear double fallthru with the comment /*fallthru*/ at the end of a switch statement. It appears to apply to the case PCHAR: which then falls though to the default: Strange to see a case after default. Code uses a /*fallthru*/ comment there but lacks 2 such comments after case 0x00d7: and case 0x00f7:. This inconsistent style confuses. Top level switch/case indents the case. Nested switch/case does not indent the case. Inconsistent style unnecessarily confuses the review. This implies OP is not using automated formatting. Coding styles that depend on manual formatting are high maintenance. Use an automated formatter.
```
switch(gettag(d)){
 case PCHAR:
 switch(getval(d)){
 case '+':
 return 0;
 case 0x00d7: // Times
 case 0x00f7: // Divided-By
 case '*':
 return 1;
 } /*fallthru*/
 default:
 case LITERAL:
```
Incomplete prototype declaration. Using () allows following code to wrongly pass parameters and the compiler does not warn.
```
// void init_en();
void init_en(void);
```
Defeat of data abstraction. integer does not imply size, bit of some size. int32 implies valuable is 32-bit. I'd expect either 1) integer to be integer32 if it is always going to be 32-bit or 2) if integer is abstract, int32 have a name without an implied size.
```
int gettag(object d) {
 integer int32;
 int32.int32 = d;
```
Possible data loss. d is of type object (presently int) and int32.int32 is of type int32_t. This inconsistency should be resolved. Recommend typedef union integer { unsigned u32; int i32; } integer;
```
int32.int32 = d; // Potential narrowing of type.
int getval(object d) {
 if (d < 0) return d; // Potential narrowing of type.
```
Even discussing the type array and struct array is confusing in context with code that uses arrays. Consider a different name.
```
// typedef struct array *array;
// Suggest
typedef struct APL_array *APL_array;
```
Not a fan of using the same name for a type and struct/union. Although these exist in different name spaces in C, they readily collide in a reviewers understanding. Suggest typedef struct APL_array_S *APL_array; or drop the typedef and only use struct APL_array.

Pedantic: Not portable when int < 24 bit.

// int32.uint32 = ((unsigned) tag << 24) | ((unsigned) val & ((1U << 24) - 1));
// more portable
int32.uint32 = (unsigned) tag;
int32.uint32 = (int32.uint32 << 24) | ((unsigned) val & ((1LU << 24) - 1));

Pedantic: Printing int values with %x outside the [0...INT_MAX] range is UB.

int x = ...
// DEBUG(3,"newdata %x(%d %d)\n", x, tag, val);
DEBUG(3,"newdata %x(%d %d)\n", (unsigned) x, tag, val);

Possible loss of info in changing types.
```
size_t *used
...
int z = (*used)++;
```
The type name object misleads. In C, an object can be a struct, union, complex double, etc. Much code depends on object being a real value.
```
void *getptr(object d) {
 if (d < 0) return NULL;
```

Adding common name to global space. These will certainly collide on a large project. Assume your good code will be re-used on larger applications. Suggest putting all the following in a struct variable, maybe called APL.

integer nulldata; // = { .data = { .tag = NULLOBJ, .val = 0 } };
object null /* = nulldata.int32 */;
integer markdata; // = { .data = { .tag = MARKOBJ, .val = 0 } };
object mark /* = markdata.int32 */;
object nil;
object blank;

For point 1, I've added links to the files. Is that sufficient or should I inline them here?
I see what you mean in point 5, but I felt it was important to highlight the one fall-through that was hard to see, viz. when none of the sub-switch's cases are taken, and the outer case falls through to the default (which itself then falls-on-through to the next case). But all the other falls are easy to see by comparison. Perhaps I should add a default: ; /*fallthru*/ to the inner switch?
For 14, instead of object, I've also considered token, datum, atom or entity. Is one of these better, or something else?
@luser droog How about APL_object? I'm thinking of how to use your good code in the future with other code.
I think the solution for 5 is to break it into 2 or more functions. It's just doing too many things.

Stack Exchange Network

Encoding data into integer handles for language interpreter

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Encoding data into integer handles for language interpreter

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions