How should I represent to-be binary data?

Question 1

I'm writing some serialization code, and I'm wondering how to deal with binary data. As I'm doing it in Python, my goal is to make it very simple, not require a lot of programmer overhead, etc.

Three options I am considering:

The fields that will be binary data are represented as hex-encoded strings. Thus you'd have something like:
```
obj = {
 'foo': 100,
 'bar': [1, 2, 3, 4],
 'baz': "ab0123ffbbaa55",
}
ObjectSpec.loads(ObjectSpec(obj).dumps())
```
ObjectSpec is the class which determines how to serialize the object.
- The pros: it's easy to look at, easy to make object literals, easy to print out.
- The cons: you have to remember to hex-encode the fields. If you have bytes, you have to hex-encode them before the serialization code then hex-decodes them. If you want to store the objects, there's more overhead unless you hex-decode the strings first.
The fields are byte strings, instead, e.g.:
```
obj = {
 'foo': 100,
 'bar': [1, 2, 3, 4],
 'baz': '\xab\x01#\xff\xbb\xaaU',
}
```
- The pros: less overhead, both in space, and in not having to hex-encode if you already have bytes.
- The cons: harder to make literals, harder to print out. If you accidentally leave in a hex-encoded string then it will serialize the wrong thing (the hex representation instead of the thing itself).
The binary data fields use some custom type, e.g. bson.Binary:
```
from bson import Binary
obj = {
 'foo': 100,
 'bar': [1, 2, 3, 4],
 'baz': Binary('\xab\x01#\xff\xbb\xaaU'),
}
```
- The pros: Same as #2, but also clearly delineates binary types.
- The cons: Same as #2, except harder to accidentally encode the wrong thing. Requires wrapping the data in a type just to get the serialization code to accept it, instead of leaving bytes in.

What would the most sensible approach be? Is there another variant that is better?

Question 2

Have you considered Google Protocol Buffers ?

Question 3

@BenCottrell: Thanks for the suggestion! I'm actually re-implementing some C++ app's custom serialization (yep...), so I can't take that option in this case.

Question 4

So you have no way to modify the C++ code for that app? Protocol Buffers is portable and has bindings in many languages, including C++.

Question 5

@BenCottrell: I can indeed change the C++ code. Can I do custom serialization with the protocol buffers though? The serialization has to be backwards-compatible.

Question 6

That depends what you mean - protocol buffers is a wire format in itself (which you can't change); so it has serialisation types for int/string/double/etc. It also allows you to package up any kind of existing arbitrary binary data using bytes. You can create a byte array and wrapper that up inside a protobuf message if you need to preserve the existing format. However if you can modify the C++ code, then you could create new messages using the bindings (The bindings are auto-generated classes using getters/setters for int/bool/std::string/etc. )

Question 7

You can define binary and hexadecimal values in python, just use 0xff or 0b1001010101001. They are defined as sub-class of int. chr and ord function reads them very clearly.

object = {
 "foo": 100 
 "bar": [1,2,3,4],
 "baz": 0xffaff34441faabc # i realy dont get what foo,bar and baz are so i dont really know what your string should represent.
}

And then use binary operators and binary shift left, binary shift right to manage it into a right order. for example sending 13 byte data to someport with imaginary syntax PORT:STOP:MESSAGE : 0xFF00737461636B65786368616E6765, or send "stackexchange" to port 255, where 0xff is port, 0x00 is STOP, and the rest is message. So your function would be something like this, first 2 bytes are port, if next iz 0x00 then split and the rest "translate" to string.

If you want to test in safe enviorment you can use
map(chr, [0x73,0x74,0x61,0x63,0x6B,0x65,0x78,0x63,0x68,0x61,0x6E,0x67,0x65])

Danilo Danilo 1314 bronze badges · Answer 1 · 2016-12-30 18:27:32Z

You can define binary and hexadecimal values in python, just use 0xff or 0b1001010101001. They are defined as sub-class of int. chr and ord function reads them very clearly.

object = {
 "foo": 100 
 "bar": [1,2,3,4],
 "baz": 0xffaff34441faabc # i realy dont get what foo,bar and baz are so i dont really know what your string should represent.
}

And then use binary operators and binary shift left, binary shift right to manage it into a right order. for example sending 13 byte data to someport with imaginary syntax PORT:STOP:MESSAGE : 0xFF00737461636B65786368616E6765, or send "stackexchange" to port 255, where 0xff is port, 0x00 is STOP, and the rest is message. So your function would be something like this, first 2 bytes are port, if next iz 0x00 then split and the rest "translate" to string.

If you want to test in safe enviorment you can use
map(chr, [0x73,0x74,0x61,0x63,0x6B,0x65,0x78,0x63,0x68,0x61,0x6E,0x67,0x65])

Stack Exchange Network

How should I represent to-be binary data?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How should I represent to-be binary data?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions