As an integral part (pun intended) of my APL interpreter (previous questions: 1 2 3 4), this suite of functions converts any data into integer handles. APL operates on arrays of data cells, so with this encoding my cells are all int
but may semantically contain other types. This enables me to have heterogeneous arrays with a homogeneous representation. Elements can even be arrays nested to arbitrary depths.
common.h
contains definitions that are needed all over the place in the source. Some are not strictly relevant here, but the master enum
of tags and the various implicit pointers, ie. typedef struct verb *verb
, are important to be aware of. Hiding pointers in typedef
s is questionable, to be sure. But I hope that the very short list of them compensates somewhat. array
, verb
, xverb
, symtab
, and magic
are all (the) implicit pointers-to-struct types. I omit the contents of ../ppnarg.h
because it is not relevant here (but common enough to be common
).
/*
* The central concept of encoding data is the use of the basic `int` type
* for "everything". We chop the 32 bits into an 8 bit tag[*] and 24 bit value.
* So we can't deal with literal numbers that are larger than 16.7 million
* or so.
*
* An `int` which contains one of our encoded-integer values should be
* declared `object` to convey this semantics to the reader.
* Conversely, having determined that an object's tag is LITERAL,
* code may feel free to treat it as a restricted-range integer value.
*
* [*] since we treat negative numbers as encoding to themselves, in essence
* we only have a 7bit tag to play with.
*/
#ifndef COMMON_H_
#define COMMON_H_
#include <stdarg.h>
#include <stdint.h>
#include <stdlib.h>
#include "../ppnarg.h"
#define MODE1(x) (x+(1<<7)) //add hi bit of ascii char
typedef int object;
typedef union integer {
uint32_t uint32;
int32_t int32;
} integer;
enum tag {
LITERAL, /* val is a 24-bit 2's comp integer */
CHAR, /* val is a 21-bit Unicode code point padded with zeros */
PCHAR, /* val is a an executable char */
MARKOBJ, /* val is irrelevant (s.b. 0) */
NULLOBJ, /* val is irrelevant (s.b. 0) */
LABEL, /* the statement number, counting from 1 */
LPAROBJ,
RPAROBJ,
SEMIOBJ,
RBRACOBJ,
FIRST_INDEXED_TYPE,
NUMBER = FIRST_INDEXED_TYPE, /* val is an index in the number table */
PROG, /* val is an (index to an) executable code fragment (array of PCHAR)*/
ARRAY, /* val is a(n index to a) boxed array */
SYMTAB, /* val is a(n index to a) symbol table */
LBRACOBJ, /* val is an (index to an) array of the bracket contents */
ANALYSIS, /* del function header info */
MAGIC, /* get/set function pair */
VERB, /* val is a(n index to a) verb object */
ADVERB, /* val is a(n index to a) verb object */
XVERB, /* val is a(n index to a) struct containing a verb and adverb */
EXPR, /* val is a(n index to a) vector of the expression contents */
BLOCK, /* val is a(n index to a) vector of expressions, a PROGN */
LAST_INDEXED_TYPE = BLOCK,
};
typedef struct array *array;
typedef struct verb *verb; // also used for adverbs
typedef object nilad(verb v);
typedef object monad(object w,verb v);
typedef object dyad(object a,object w,verb v);
typedef struct xverb *xverb;
typedef struct symtab *symtab;
typedef struct magic *magic;
#ifdef DEBUGMODE
#define DEBUG(LVL,...) if (LVL<=DEBUGMODE) fprintf(stderr, __VA_ARGS__)
#define IFDEBUG(LVL,...) do if (LVL<=DEBUGMODE) { __VA_ARGS__; } while(0)
#else
#define DEBUG(...)
#define IFDEBUG(...)
#endif
#endif
encoding.h
declares a few singleton objects and the functions that are exposed. init_en()
must be called before any other encoding functions.
#ifndef ENCODING_H_
#define ENCODING_H_
#include "common.h"
extern object null;
extern object mark;
extern object nil;
extern object blank;
void init_en();
int gettag(object d);
int getval(object d);
object newdata(int tag, int val);
object cache(int tag, void *ptr);
void *getptr(object d);
object getfill(object d);
#endif
And the implementation. For simple types, the newdata
function constructs an encoded value by ORing the tag and value bits. For the pointer types, the cache
function saves the pointer in a data bank structure (called memory_bank
) and yields an index to be used as the value argument to newdata
. The behavior depends upon the careful arrangement of the pointer type tags into a contiguous range so the tag number can be arithmetically manipulated to/from index values.
/* Encoding
*
* this file defines the sub-typing of data atoms.
* All data are packed into integer handles. The benefit for
* array operations is all data atoms will have a uniform
* size no matter what the content actually is. This replaces
* the intptr_t hackery (ab)used in earlier versions
* (not portable to 64bit build).
*
* the array data are always just straight 32bit integers.
* but we treat as a 7bit tag and 24bit integer value.
* An immediate integer value is indicated by a negative
* sign-bit or all-zero tag. In essence, a 25bit sign/magnitude
* rep with no -0. This also means that we're not really using
* up all the available bits. Depending upon the final suite
* of distinct types and the desired "word size", this arrangement
* might be optimized further.
*
* Composite objects (boxed or reference objects) have
* an associated pointer stored in an array associated
* with the tag. Thus an array object can be enclosed
* into a scalar (integer handle) with
*
* int x;
* x = cache(ARRAY, array_new_dims(3,3)); //3x3 matrix
*
* To better convey the abstract use of this integer type,
* we will make use of this typedef to designate such int-handles.
*
* commont.h:
* typedef int object;
*
* the array data structure (which is implicitly a pointer
* to its struct) can be retrived from the handle
* with
*
* array a;
* a = getptr(x);
*
* Most functions will need to check the types of their
* arguments in order to determine how to proceed.
* This can be accomplished with `gettag()`.
*
* switch(gettag(x)){
* case LITERAL: // handle atomic integer
* break;
* case ARRAY: {
* array X = getptr(x);
* }
* }
*/
#include <stdio.h>
#include <stdlib.h>
#include "common.h"
#include "encoding.h"
#include "array.h"
int gettag(object d){
if (d<0) return 0; /* negatives are literals */
integer int32;
int32.int32 = d;
return int32.uint32 >> 24;
}
int getval(object d){
if (d<0) return d;
integer int32;
int32.int32 = d;
return int32.uint32 & ((1U<<24)-1);
}
object newdata(int tag, int val){
if (tag==LITERAL && val<0) return val;
integer int32;
int32.uint32 = ((unsigned)tag << 24) | ((unsigned)val & ((1U<<24)-1));
int x = int32.int32;
DEBUG(3,"newdata %x(%d %d)\n", x, tag, val);
return x;
}
integer nulldata;// = { .data = { .tag = NULLOBJ, .val = 0 } };
object null /* = nulldata.int32 */;
integer markdata;// = { .data = { .tag = MARKOBJ, .val = 0 } };
object mark /* = markdata.int32 */;
object nil;
object blank;
void init_en(void){
nulldata.uint32 = newdata(NULLOBJ, 0);
null = nulldata.int32;
markdata.uint32 = newdata(MARKOBJ, 0);
mark = markdata.int32;
cache(LBRACOBJ, array_new_rank_dims(0));
blank = newdata(CHAR, ' ');
}
int addnewtocache(size_t *used, size_t *max, void ***data, void *ptr){
if (*used == *max){
*max = *max * 7 + 11;
void *tmp = realloc(*data, *max * sizeof(void*));
if (!tmp) return null;
*data = tmp;
}
int z = (*used)++;
(*data)[z] = ptr;
DEBUG(3,"addnew %d %p %p\n", z, ptr, (*data)[z]);
return z;
}
struct memory_bank {
size_t used, max;
void **tab;
} memory_bank[LAST_INDEXED_TYPE - FIRST_INDEXED_TYPE + 1];
object cache(int tag, void *ptr){
if (tag < FIRST_INDEXED_TYPE || tag > LAST_INDEXED_TYPE)
return null;
int idx = tag - FIRST_INDEXED_TYPE;
return newdata(tag,
addnewtocache(&memory_bank[idx].used,
&memory_bank[idx].max,
&memory_bank[idx].tab,
ptr));
}
void *getptr(object d){
if (d<0) return NULL;
int tag = gettag(d);
if (tag < FIRST_INDEXED_TYPE || tag > LAST_INDEXED_TYPE)
return NULL;
int idx = tag - FIRST_INDEXED_TYPE;
return memory_bank[idx].tab[getval(d)];
}
// fill returns a "blank" value for any type
// and identity elements for verbs
object getfill(object d){
switch(gettag(d)){
case PCHAR:
switch(getval(d)){
case '+':
return 0;
case 0x00d7: // Times
case 0x00f7: // Divided-By
case '*':
return 1;
} /*fallthru*/
default:
case LITERAL:
return newdata(CHAR, 0x2300); //null
return newdata(CHAR, 0x2316); //position
return newdata(CHAR, 0x2218); //jot
//return newdata(LITERAL, (1<<24)-1);
case CHAR: return newdata(CHAR, ' ');
}
}
Additional referenced files from the same commit: array.h ../ppnarg.h. See my SO Question for an in-depth explanation of the array code. ../ppnarg.h
implements a macro which counts how many arguments it was called with. This is used to provide variadic constructors for the array type. It's a wonderous little macro that should be part of everyone's cpp arsenal, but I didn't write it.
Any problem spots or infelicities? Any "too clever" spots?
I fear I may have made it difficult to design a garbage collector since my allocations are scattered all over this "data bank". It doesn't lend itself to a generational collector or a copying collector, and even mark-sweep will leave unused handles unless there's an extra free-list just for handles. Is there a better way to design my data bank to make it easier to scan for garbage?
-
\$\begingroup\$ For `reference: Wiki note APL \$\endgroup\$chux– chux2016年09月10日 12:31:15 +00:00Commented Sep 10, 2016 at 12:31
-
\$\begingroup\$ comp.lang.c meta-discussion of my GC obsession with this code. \$\endgroup\$luser droog– luser droog2016年10月01日 02:24:40 +00:00Commented Oct 1, 2016 at 2:24
1 Answer 1
Any problem spots or infelicities? Any "too clever" spots?
Overall: Recommend to not causally mix int/unsigned
, int32_t/uint32_t
and size_t
. Too many cases where code would fail by using both pairs. Suggest using int/unsigned
instead of int32_t/uint32_t
with a #if INT_MAX >= 2147483647
test. (or fix/test the casual conversions.)
Disagree with not detailing
../ppnarg.h
. If the contents of../ppnarg.h
are common enough to be common, why is notcommon.h
also considered common enough to be not posted? Lacking this tile, a review cannot completely compile the code. Without compilation, automated tools are prevented from use. Without automated tools, this review request becomes unnecessarily extra burdensome.common.h
is too common. Consider your successful code being used on a larger effort. PerhapsAPL_common.h
.?What/where is
#include "array.h"
?"retrived" --> "retrieved"
Unclear double fallthru with the comment
/*fallthru*/
at the end of a switch statement. It appears to apply to thecase PCHAR:
which then falls though to thedefault:
Strange to see acase
afterdefault
. Code uses a/*fallthru*/
comment there but lacks 2 such comments aftercase 0x00d7:
andcase 0x00f7:
. This inconsistent style confuses. Top levelswitch/case
indents thecase
. Nestedswitch/case
does not indent thecase
. Inconsistent style unnecessarily confuses the review. This implies OP is not using automated formatting. Coding styles that depend on manual formatting are high maintenance. Use an automated formatter.switch(gettag(d)){ case PCHAR: switch(getval(d)){ case '+': return 0; case 0x00d7: // Times case 0x00f7: // Divided-By case '*': return 1; } /*fallthru*/ default: case LITERAL:
Incomplete prototype declaration. Using
()
allows following code to wrongly pass parameters and the compiler does not warn.// void init_en(); void init_en(void);
Defeat of data abstraction.
integer
does not imply size, bit of some size.int32
implies valuable is 32-bit. I'd expect either 1)integer
to beinteger32
if it is always going to be 32-bit or 2) ifinteger
is abstract,int32
have a name without an implied size.int gettag(object d) { integer int32; int32.int32 = d;
Possible data loss.
d
is of typeobject
(presentlyint
) andint32.int32
is of typeint32_t
. This inconsistency should be resolved. Recommendtypedef union integer { unsigned u32; int i32; } integer;
int32.int32 = d; // Potential narrowing of type. int getval(object d) { if (d < 0) return d; // Potential narrowing of type.
Even discussing the type
array
andstruct array
is confusing in context with code that uses arrays. Consider a different name.// typedef struct array *array; // Suggest typedef struct APL_array *APL_array;
Not a fan of using the same name for a type and
struct/union
. Although these exist in different name spaces in C, they readily collide in a reviewers understanding. Suggesttypedef struct APL_array_S *APL_array;
or drop the typedef and only usestruct APL_array
.Pedantic: Not portable when
int
< 24 bit.// int32.uint32 = ((unsigned) tag << 24) | ((unsigned) val & ((1U << 24) - 1)); // more portable int32.uint32 = (unsigned) tag; int32.uint32 = (int32.uint32 << 24) | ((unsigned) val & ((1LU << 24) - 1));
Pedantic: Printing
int
values with%x
outside the[0...INT_MAX]
range is UB.int x = ... // DEBUG(3,"newdata %x(%d %d)\n", x, tag, val); DEBUG(3,"newdata %x(%d %d)\n", (unsigned) x, tag, val);
Possible loss of info in changing types.
size_t *used ... int z = (*used)++;
The type name
object
misleads. In C, an object can be astruct
,union
,complex double
, etc. Much code depends onobject
being a real value.void *getptr(object d) { if (d < 0) return NULL;
Adding common name to global space. These will certainly collide on a large project. Assume your good code will be re-used on larger applications. Suggest putting all the following in a
struct
variable, maybe calledAPL
.integer nulldata; // = { .data = { .tag = NULLOBJ, .val = 0 } }; object null /* = nulldata.int32 */; integer markdata; // = { .data = { .tag = MARKOBJ, .val = 0 } }; object mark /* = markdata.int32 */; object nil; object blank;
-
\$\begingroup\$ For point 1, I've added links to the files. Is that sufficient or should I inline them here? \$\endgroup\$luser droog– luser droog2016年09月10日 22:31:19 +00:00Commented Sep 10, 2016 at 22:31
-
\$\begingroup\$ I see what you mean in point 5, but I felt it was important to highlight the one fall-through that was hard to see, viz. when none of the sub-switch's cases are taken, and the outer case falls through to the default (which itself then falls-on-through to the next case). But all the other falls are easy to see by comparison. Perhaps I should add a
default: ; /*fallthru*/
to the innerswitch
? \$\endgroup\$luser droog– luser droog2016年09月10日 23:07:38 +00:00Commented Sep 10, 2016 at 23:07 -
\$\begingroup\$ For 14, instead of object, I've also considered
token
,datum
,atom
orentity
. Is one of these better, or something else? \$\endgroup\$luser droog– luser droog2016年09月10日 23:43:09 +00:00Commented Sep 10, 2016 at 23:43 -
\$\begingroup\$ @luser droog How about
APL_object
? I'm thinking of how to use your good code in the future with other code. \$\endgroup\$chux– chux2016年09月11日 01:53:26 +00:00Commented Sep 11, 2016 at 1:53 -
1\$\begingroup\$ I think the solution for 5 is to break it into 2 or more functions. It's just doing too many things. \$\endgroup\$luser droog– luser droog2016年09月28日 19:24:43 +00:00Commented Sep 28, 2016 at 19:24