Can you kindly take a look at the code? Any suggestions regarding performance, readability, potential bugs, portability, security measures, etc. are highly appreciated.
The code reads a character, a byte is formed by using 4 msbs of the data and a random number placed in 4 lsbs which is an offset to lower half of the first byte. The gap filled by (pseudo) random numbers.
/* File: fat.c: Reads from stdin, writes the byte manipulated
output to stdout in hex string or raw binary. */
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <time.h>
// init random number generator using time and PID
#ifdef _WIN32
#include <windows.h>
#define RANDOMIZE() \
(srand((unsigned)time(NULL) ^ (unsigned)GetCurrentProcessId()))
#else
#include <unistd.h>
#define RANDOMIZE() \
(srand((unsigned)time(NULL) ^ (unsigned)getpid()))
#endif
#define HIGH_NIBBLE 0xf0
#define LOW_NIBLLE 0x0f
#define PADDING_MASK 0x0f
// define available output formats and corresponding format strings
enum output_formats{
FMT_HEX_STR,
FMT_BINARY
};
const char *const formatstrings[] = {
"%02x",
"%c"
};
void encrypt_print(unsigned char data, int format)
{
unsigned padding;
unsigned char output;
padding = rand() & PADDING_MASK; // random N (0 to 15) bytes for padding
output = data | padding; // set low nibble to N padding bytes
printf(formatstrings[format], output);
for(int i=0; i < padding; i++)
printf(formatstrings[format],(unsigned char) rand());
}
int main(int argc, char const *argv[])
{
unsigned char hi_nib, lo_nib;
int input_chr, format;
clearerr(stdin); // fgetc() doesn't set errno
RANDOMIZE();
format = FMT_HEX_STR; // alternate: FMT_BINARY
while ((input_chr = fgetc(stdin)) != EOF) {
hi_nib = input_chr & HIGH_NIBBLE;
lo_nib = input_chr & LOW_NIBLLE;
encrypt_print(hi_nib, format); // higher 4 bits of input byte
encrypt_print(lo_nib << 4, format); // lower 4 bits of input shifts left
}
// In case of a read error...
if(ferror(stdin)){
fprintf(stderr, "Error while reading input\n");
fflush(stderr);
return EIO;
}
fflush(stdout);
return EXIT_SUCCESS;
} // main
P.S: All great and helpful insights, selected the highest voted answer.
4 Answers 4
DRY
This is Too Complicated:
#ifdef _WIN32
#include <windows.h>
#define RANDOMIZE() \
(srand((unsigned)time(NULL) ^ (unsigned)GetCurrentProcessId()))
#else
#include <unistd.h>
#define RANDOMIZE() \
(srand((unsigned)time(NULL) ^ (unsigned)getpid()))
#endif
Define GET_PID instead.
And then you're in a position to portably combine it with srand()
.
single responsibility
void encrypt_print( ...
The identifier is a dead giveaway. We're doing two things:
- encrypting
- printing
Define encrypt()
which writes crypted bytes to
a supplied binary output buffer.
And then worry about how to print them.
(Yes, I know, average buffer consumption will be just half of
the worst case consumption, get over it, memory is cheap.
Or define a fancy API which doesn't always consume
the entire input buffer.)
crypto
The review context speaks of "byte manipulation encryption". But this is at best an encoding scheme.
The OP code completely ignores Kerchoffs' principle. For one thing, there's no keying material. Please don't use the word "encrypt" to describe this mapping function.
Many authors would use the word "chaff" to describe OP's "padding".
specification
The review context didn't really spell out what the "on the wire" format is.
And encrypt_print()
comes with no accompanying comments
on the gazintas and the gazoutas.
We really should see an English sentence that explains
caller must pass in data
with the low nybble cleared to zero.
Ideally the function would signal an error if that pre-condition
is violated.
(And who spells it "nibble", anyway? Maybe that's half a "bite"?)
indistinguishability
The OP includes no requirements or desiderata, but one might be that "someone who encounters the output cannot distinguish it from rand() output," even if the input is English sentences rendered with printable 7-bit ASCII characters plus SPACE and newline. Output is roughly an order of magnitude bigger than input.
When viewing output nybbles, some will be more popular than others. And they always appear in adjacent bytes. This may be enough to flunk a randomness test. The amount of chaff dwarfs the wheat, presumably so the statistics won't look too far off.
If you want to produce more compact output, consider viewing it as a bit stream, so you can flexibly insert one or more bits of chaff. That way the wheat will appear at all eight possible bit offsets.
As a separate idea, consider replacing the fair coin toss of rand() with a biased coin, so roughly half the output bits will be set. Count the number of ones output in some recent window, and as it strays from 50% we adjust the bias of our unfair coin. This will make it harder to notice we processed simple ASCII input.
sanity check
Consider incorporating things like lengths or especially checksums into the specified output format. Then if the middle one third of a transmission is lost or mangled, we can find where the last third begins, and successfully recover it.
For example, we could start with some keying material, and take SHA3 of that catenated with input to be the padding, rather than use a PRNG or true /dev/random bits as padding. There's also an opportunity to begin the output with random salt which contributes to such a cumulative checksum.
Consider rolling a random 64-bit number which you put in the source code, and then always start the output stream with that magic number. That will make it easy to recognize this peculiar file format. If you switch to some v2 version of the format, change the magic number as well.
test suite
We see the encoding half here, without the corresponding source code to decode a bytestream. This setup is perfect for "roundtrip" tests, where we mangle the input and then bring it back alive. Write some automated tests, and then apply a fuzzer.
-
\$\begingroup\$ TYSM for valuable feedback. GET_PID done, also separated the two functions. about the encryption you're totally correct, I misused the word. This code I wrote mainly for encoding my small data (passes,urls, etc) bc it has a large overhead (but it doesn't mean to continue with my mistakes). I'm working on others. Just saw in your latest edit some will be more popular than others. And they always appear in adjacent bytes like what? and also about memory, you don't mean a block like 4096B? ig 32B are enough for now? \$\endgroup\$user174174– user1741742024年05月06日 08:21:55 +00:00Commented May 6, 2024 at 8:21
-
\$\begingroup\$ GPG might be a good fit for your crypted passphrases. In general you want AES or similar, rather than the OP rot13 approach. // "more popular" When examining the output from functions like SHA3 or AES, we expect roughly 50% of the bits to be
0
, 1/16th of the nybbles to be0000
, and 1/256th of the bytes to be all zero. Any deviation from that lets us win a game where we distinguish /dev/random output from OP's output. English ASCII text is not uniform, and you send such nybbles unmodified. There is structure: two together, then chaff. \$\endgroup\$J_H– J_H2024年05月06日 16:30:11 +00:00Commented May 6, 2024 at 16:30 -
1\$\begingroup\$ @J_H I think you've misread the operation. This IS weak, but... There are (randomly) 0 to 15 bytes of chaff following each n[yi]bble of output. Chaff is everywhere... \$\endgroup\$Fe2O3– Fe2O32024年05月06日 19:58:44 +00:00Commented May 6, 2024 at 19:58
-
1\$\begingroup\$ Uggh, quite right. Nonetheless, simple frequency analysis on nybbles quickly leads to distinguishability. Next we look for structure. "etaoin shrdlu" tells us the nybble
0x6
will be popular, so having found one we try to predict where another will appear. Which may lead the analyst to write down position deltas, and to notice that0x6
often shows up within 32, and0x5
and0x1
appear abnormally often within 16 positions, at which point Kerchoffs may become irrelevant. \$\endgroup\$J_H– J_H2024年05月06日 20:12:30 +00:00Commented May 6, 2024 at 20:12
Use the standard prototype for main()
:
The C standard specifies:
int main(void)
and:
int main(int argc, char **argv) // or *argv[]
There's none where argv
has type char const *argv[]
. Though you haven't used argc
and argv
in the code, so I don't see why you aren't using the first form.
(Here I am assuming that you do not have any interest in parsing command line arguments to switch between the two output formats).
Unnecessary clearing of stdin
's error indidcator:
clearerr(stdin); // fgetc() doesn't set errno
It may not. There's no reason for stdin
's error indicator to be set at the beginning of the program that would necessitate this call. Simply elide it.
Confine variables to the smallest scope possible:
Since C99, variables don't need to be declared at the top of a block.
unsigned char hi_nib, lo_nib;
int input_chr, format;
Besides input_chr
, all of these variables are used inside the while
loop, and that's where you should declare them.
And hi_nib
and low_nib
can probably be eliminated.
Superfluous comments:
// In case of a read error...
if(ferror(stdin)){
fprintf(stderr, "Error while reading input\n");
fflush(stderr);
return EIO;
}
fflush(stdout);
return EXIT_SUCCESS;
} // main
ferror(stdin)
was clear enough, you need not explain that it is checking for an error.
And it is very clear where main()
ends, so // main
should be removed.
Unnecessary flushing of stderr
:
By default, stderr
is unbuffered. There's no need to explicitly call fflush(stderr)
.
Lack of error-checking for write operations, and lack of newline:
We have ignored the return value of printf()
and the subsequent call to fflush()
. According to the ISO C Standard:
The
printf
function returns the number of characters transmitted, or a negative value if an output or encoding error occurred.The
fflush
function sets the error indicator for the stream and returns EOF if a write error occurs, otherwise it returns zero.
It would be harmful to return success status and mislead the user into removing the source file if the destination isn't fully written.
I'd do:
if (printf(formatstrings[format],(unsigned char) rand()) < 0) {
/* Fail with an appropriate error code. */
}
and
if (fflush(stdout) == EOF) {
/* Likewise. */
}
You have not printed a newline after all this, which messes up the subsequent prompt in my terminal.
Minor:
format_strings
should be static
, and you don't need to use a tag for the enumerated type here. And it would be more ergonomic if the printing option was taken as a command line argument. In encrypt_print()
, you're comparing a signed int
to an unsigned int
in the loop condition. The compiler would have warned about it, had you enabled compiler warnings.
This is how I would write the code in C23/C24:
#include <errno.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// The level of indirection is needed, INDIRECT(X) would expand to "X".
#define STRINGIFY(X) INDIRECT(X)
#define INDIRECT(X) #X
static constexpr const uint8_t HIGH_NIBBLE = 0xf0;
static constexpr const uint8_t LOW_NIBBLE = 0x0f;
static constexpr const uint8_t PADDING_MASK = 0x0f;
[[gnu::always_inline]] static inline void randomize(void)
{
#ifdef _WIN32
#include <windows.h>
srand((unsigned) time(nullptr) ^ (unsigned) GetCurrentProcessId());
/* All current UNIX-style OSes, including BSD, Linux, OSX, and Solaris. */
#elif defined(__unix__) || defined(__unix) || (defined(__APPLE__) && defined(__MACH__))
#include <unistd.h>
srand((unsigned) time(nullptr) ^ (unsigned) getpid());
#else
unreachable();
#endif
}
// Credit: @Fe2O3 (except the inlining part)
[[nodiscard, gnu::always_inline]] static inline bool encrypt_print(uint8_t ch)
{
uint8_t pad = rand() & PADDING_MASK, uch = ch | pad; // set low nibble (0 - 0xF)
// NB: pad ALSO acts as # of appended "chaff" bytes
// decode() can determine how many following "chaff" bytes to ignore
// Consider:
// if pad == 0, at least ch escapes
// if pad == 1, ch escapes, pad decrements, and one "chaff" escapes
// and so on...
do {
// set via "cc -D FMT=c a.c -o binchar.out"
// alternate "cc -D FMT=02x a.c -o hexpair.out"
if (printf("%" STRINGIFY(FMT), uch) < 0) {
return false;
}
// until done, keep reloading with a random byte value (0x01 - 0xFF)
} while(pad-- && (uch = (uint8_t)(rand() + 1)) != 0);
return true;
}
int main(void)
{
randomize();
int input_chr = {};
while ((input_chr = fgetc(stdin)) != EOF) {
if (!encrypt_print(input_chr & HIGH_NIBBLE) // higher 4 bits of input byte
|| !encrypt_print((input_chr & LOW_NIBBLE) << 4)) { // lower 4 bits of input shifts left
fprintf(stderr, "Error while writing output.\n");
return EIO;
}
}
if (ferror(stdin)) {
fprintf(stderr, "Error while reading input.\n");
return EIO;
}
putchar('\n');
if (fflush(stdout) == EOF) {
fprintf(stderr, "Error while writing output.\n");
return EIO;
}
return EXIT_SUCCESS;
}
Some new goodies I have used:
nullptr
constant- fixed underlying types for enumerated types
unreachable
macroconstexpr
for object definitions- attribute specifier sequence
- initialization of objects with empty braces
I haven't done it myself, but as @J_H suggested, encrypt_print()
should be replaced with two separate encrypt()
and print()
functions.
I have removed the array of strings (there were only two output formats).
Instead of assuming, I have added a check for all UNIX style operating systems. Note that executing unreachable()
would invoke undefined behavior, and you'd be better off replacing it with the error
preprocessor directive. You can also use the new __has_include
instead of checking for the presence of macros.
In OP's original code, if the format
has to be changed to FMT_BINARY
, format = FMT_HEX_STR
must be changed, and the program recompiled for alternate executable. I have tried avoiding that by passing a -D
flag to the compiler (examples are included in the code). This has some benefits:
- Both versions of executable can optionally (makefile targets) be built at once.
- Eliminate
enum
, the corresponding array of strings, and the ternary and the messy "edit/compile/re-edit/compile" cycle to generate both versions. format
is no longer a function parameter.- The code is simplified, and the LOC is reduced.
You may also wonder why I haven't included stdbool.h
for the definition of bool
, true
, and false
. It is because the inclusion is no longer required, as bool
, true
, and false
are keywords in C23/C24 (so is static_assert
, and thread_local
).
You could also elide the void
in main()
's declaration. Because function declarations whose parameter list is empty is now treated the same as a parameter list which contain a single void
, same as C++.
And if an implementation doesn't support the attributes, they shalt be ignored.
-
\$\begingroup\$ Instead of removing
fflush(stdout)
, isn't it better to actually use its return value, and warn the user if we were unable to completely write the output? Returning frommain()
without flushing or closing causes any write errors at that stage to be ignored. (And there's other places where we're missing error checking when we write - it would be harmful to return success status and mislead the user into removing the source file if the destination isn't fully written). \$\endgroup\$Toby Speight– Toby Speight2024年05月07日 06:43:31 +00:00Commented May 7, 2024 at 6:43 -
\$\begingroup\$ TYSM for sharing this answer. unfortunately I couldn't run your code (my coompiler is old gcc 12.2 for your bleeding edge code). main() yes, vscode extensions... Unnecessary flushing of stdout, and lack of newline ig it's a blocking function I'm aware of closing streams on exit, just to ensure everything has written or as @TobySpeight mentioned for future error checking. Omitting the newline is intentional there would be one extra byte in output which corrupts the data when decoding. My future goal was to use command line options for format, a file name, and max pad size. still long way... \$\endgroup\$user174174– user1741742024年05月07日 08:03:46 +00:00Commented May 7, 2024 at 8:03
-
\$\begingroup\$ ...And please explain why you inline randomize but not encrypt_print? \$\endgroup\$user174174– user1741742024年05月07日 08:11:57 +00:00Commented May 7, 2024 at 8:11
-
1\$\begingroup\$ The advice about the declaration of
main
is not correct. The C standard (5.1.2.2.1) specifiesint main(void)
,int main(int argc, char *argv[])
, "or equivalent." So, the original declaration is fine. (Or if you don’t think so,char **argv
is wrong.) \$\endgroup\$Davislor– Davislor2024年05月08日 21:13:43 +00:00Commented May 8, 2024 at 21:13 -
1\$\begingroup\$ @Davislor: No, 6.7.6.3 does not say that. What 6.7.6.3 14 does say means that parameter declarations of
T *argv[]
,T *argv[const]
, andT ** const argv
would be taken as the same when considering function compatibility. Note that the qualifier on the parameter directly is removed when considering compatibility, not the qualifier on what the parameter points to or any further levels of indirection. Specifically, it says "... In the determination of type compatibility... each parameter declared with qualified type is taken as having the unqualified version of its declared type." \$\endgroup\$Eric Postpischil– Eric Postpischil2024年05月09日 11:25:32 +00:00Commented May 9, 2024 at 11:25
A few that no one else has brought up yet:
Use the OS’ Entropy Pools
Since you’re using #ifdef
anyway to support Windows and Linux, you can use ProcessPrng
on Windows or fread
from /dev/urandom
on Linux. You can either write your own agnostic wrapper that calls one or the other, or extract the seed for a deterministic PRNG if you need to be able to reproduce the same output.
If you’re combining different sources of entropy this way, I’d probably mix the lower-order bits together in a bitfield. The higher-order bits of time(NULL)
will be very predictable (An attacker can probably guess the decade.) and so will those of the process ID (There were probably many fewer than 4 billion processes running on the server at that time.) You can portably get additional entropy from the wall-clock time in nanoseconds from timespec_get
, the process’ running time from clock
, and pointers to objects with different storage classes on OSes with address-space layout randomization.
Error-Handling
You currently have:
// In case of a read error...
if(ferror(stdin)){
fprintf(stderr, "Error while reading input\n");
fflush(stderr);
return EIO;
}
fflush(stdout);
The biggest problem here is that you don’t flush stdout
, which is buffered, before writing to stderr
. Since both write to the console by default, this can garble your output.
You do not, however, need to flush stderr
before you terminate the program, when you don’t write to stdout
again, since all files are flushed on exit.
It would be better to print an error message that includes the file and line number where the failure occurred, and the specific error from strerror
. If you always terminate on an output error, rather than retrying, it would be best to put this check in your output function, so the stack trace you get in the debugger has the full context, rather than in main()
. As others have said, separate encoding and printing.
The constant EIO
is technically not in the C standard, although it is in the C++ standard, POSIX, and the Windows API. If you wanted maximum portability, you could return EXIT_FAILURE
.
Some boilerplate I’ve often used (for those library functions that set errno
) is:
#define fatal_system_error(msg) \
fatal_system_error_handler( (msg), errno, __FILE__, __LINE__ )
Which calls the helper:
void fatal_system_error_handler( const char* const msg,
const int err,
const char* sourcefile,
const int line )
{
fflush(stdout); // Don't cross the streams!
fprintf( stderr,
"%s (%s:%d): %s\n",
msg,
sourcefile,
line,
strerror(err) );
exit(EXIT_FAILURE);
}
In C23, [[noreturn]]
is standard, so you can add that. If you need to support compilers that don’t have this, #define NORETURN
to how each compiler spells it.
Use Feature-Test Macros
Since there are several incompatible versions of the POSIX API, best practice is to have your source declare which one it is using:
#define _POSIX_C_SOURCE 200809L
#define _XOPEN_SOURCE 700
Put these before any system headers.
On Windows, you either want to #include <sdkddkver.h>
to get the latest Windows API your libraries know about, or set the version of Windows you are targeting. For example, to target version 1903 (which adds UTF-8 support) and up:
#define _WIN32_WINNT 0x0A00 // Windows 10
#define NTDDI_VERSION 0x0A000007 // version 1903
#include <sdkddkver.h>
-
\$\begingroup\$ @OP "In C23,
[[noreturn]]
is standard, so you can add that" ==> For C11 and above, you can also use the_Noreturn
keyword, or the convenience macronoreturn
fromstdnoreturn.h
. (These are termed as deprecated in C23 of course.) \$\endgroup\$Madagascar– Madagascar2024年05月07日 18:12:05 +00:00Commented May 7, 2024 at 18:12 -
\$\begingroup\$ I didn't know about
/dev/urandom
. Is that available on other UNIX-like operating systems too? \$\endgroup\$Madagascar– Madagascar2024年05月07日 18:14:35 +00:00Commented May 7, 2024 at 18:14 -
\$\begingroup\$ "It would be better to print an error message that includes the file and line number where the failure occurred, and the specific error from
strerror
." ==> Fair enough, but does the ISO C Standard specify thatfgetc()
would seterrno
to something meaningful on an error? (The POSIX Standard does, I believe.) \$\endgroup\$Madagascar– Madagascar2024年05月07日 18:18:55 +00:00Commented May 7, 2024 at 18:18 -
\$\begingroup\$ @Harith
_Noreturn
is deprecated and wouldn’t work in C++, but it would make sense to check#elif __STDC_VERSION__ >= 201112L
and#define NORETURN _Noreturn
if so. \$\endgroup\$Davislor– Davislor2024年05月07日 18:22:17 +00:00Commented May 7, 2024 at 18:22 -
1\$\begingroup\$ @J_H Other way around:
/dev/random
blocks until it can get enough high-quality entropy;/dev/urandom
uses a PRNG to fill in the rest of the bits if needed. (The U is for unlimited.) \$\endgroup\$Davislor– Davislor2024年05月07日 18:35:24 +00:00Commented May 7, 2024 at 18:35
Separation of concerns
Instead of a program that includes code to write either raw characters or hex, simplify to a program that writes the raw output, and allow users to pipe into the hex converter of their choice (od
, xxd
etc.). That makes for a program that's easier to maintain (and review!) and gives the user more flexibility (octal output, anyone?)
-
\$\begingroup\$ TYSM, I don't have a regular programming experience or even using windows in recent years, and don't think od or xxd are available there. Yes dumping binary is easier but also I may come up with some specific formats of my own. again I can write a separate code for that but for now there's much to improve ignoring the format. indeed you're correct thanks for valuable feedback. \$\endgroup\$user174174– user1741742024年05月07日 20:45:02 +00:00Commented May 7, 2024 at 20:45
/+
and=
for padding. I will try it... the idea is good but much harder to implement for me :) the idea behind this code was from FAT16... and ultimatelyecho -n SomeText | fat -e -f h -b 4 > file.txt
-[e]ncode -[d]ecode] -f [h]ex [b]in -b [1,2,3,4] bits to use for chaff. \$\endgroup\$