Parse data file and verify checksum

Question 1

Here's my assignment:

Write a program that takes a filename as its only command line argument and prints the results to std::out formatted as follows:
The first line of the output should be whether or not the checksum passed ("Checksum passed", "Checksum failed")

If the checksum passed, the rest of the file will contain the parsed data. Print label value pairs separated by a colon for each field, one per line (all numbers should be represented as decimals).
An expected output file is provided for the first binary file to demonstrate format.

File structure (Little Endian):
 size: 4 bytes
 entry count: 2 bytes
 --- entry ---
 first name length: 4 bytes
 first name: first name length bytes
 last name length: 4 bytes
 last name: last name length bytes
 flags: 1 byte
 * age: 1 byte
 * height: 1 byte
 -------------
 zero padding: x bytes
 checksum: 4 bytes
 * = optional
Field descriptions:
size: Total number of bytes in the file

entry count: Number of entries in the file

entry

first name length: Number of characters in the following first name string since it is not NULL terminated

first name: Non-null terminated string

last name length: Number of characters in the following last name string since it is not NULL terminated

last name: Non-null terminated string

flags: Bit mask which indicates presence of either age or height fields.

age field present

height field present

age: Age in years. Optional field.

height: Height in inches. Optional field.

zero padding - The file is zero padded to ensure 4 byte alignment.

checksum: This value is calculated by summing every four bytes of the data (not including the actual checksum value)

FileParser.cpp

// FileParser.cpp : Defines the entry point for the console application.
//
#include<iostream>
#include <string>
#include <fstream>
#include <sstream>
void ParseFile(char* dataBuffer);
int main(int argc, char *argv[])
{
 std::string inputFileName = argv[1];
 std::ifstream inputFile(inputFileName, std::ios::binary | std::ios::ate);
 if (inputFile.good())
 {
 //get File size
 std::streamoff size = inputFile.tellg();
 char *data = new char[size];
 inputFile.seekg(0, std::ios::beg);
 inputFile.read(data, size);
 inputFile.close();
 int checkSum = 0;
 for (int i = 4; i <= size - 4; i = i + 4)
 {
 checkSum += (data[i - 1] << 24) |
 (data[i - 2] << 16) |
 (data[i - 3] << 8) |
 (data[i - 4]);
 }
 int checksumfromFile = 0;
 memcpy(&checksumfromFile, &data[size - 4], sizeof(int));
 if (checkSum == checksumfromFile)
 {
 std::cout << "Checksum passed" << std::endl;
 ParseFile(data);
 }
 else
 std::cout << "Checksum Failed" << std::endl;
 }
 return 0;
}
void ParseFile(char* data)
{
 unsigned char sizeinBytes = 4;
 unsigned char count = 2;
 unsigned char firstNameLength = 4;
 unsigned char lastNameLength = 4;
 unsigned char flags = 1;
 unsigned char age_byte = 1;
 unsigned char height_byte = 1;
 unsigned char checksum = 4;
 int offset = 0;
 // size
 int size;
 memcpy(&size, &data[offset], sizeof(int));
 std::cout << "size:" << size << std::endl;
 offset += sizeinBytes;
 //count of entries
 short entryCount;
 memcpy(&entryCount, &data[offset], sizeof(short));
 std::cout << "entry count::" << entryCount << std::endl;
 offset += count;
 for (int i = 0; i < entryCount; i++)
 {
 //firstNameLength
 int lengthofFirstName;
 memcpy(&lengthofFirstName, &data[offset], sizeof(int));
 std::cout << "first name length:" << lengthofFirstName << std::endl;
 offset += firstNameLength;
 //firstName
 char* firstName = new char[lengthofFirstName];
 memcpy(firstName, &data[offset], lengthofFirstName);
 firstName[lengthofFirstName] = '0円';
 std::cout << "first name:" << firstName << std::endl;
 offset += lengthofFirstName;
 // 4 bytes
 int lengthofLastName;
 memcpy(&lengthofLastName, &data[offset], sizeof(int));
 std::cout << "last name length:" << lengthofLastName << std::endl;
 offset += lastNameLength;
 char* lastName = new char[lengthofLastName];
 memcpy(lastName, &data[offset], lengthofLastName);
 lastName[lengthofLastName] = '0円';
 std::cout << "last name:" << lastName << std::endl;
 offset += lengthofLastName;
 char flag = data[offset];
 std::cout << "flags:" << (int)flag << std::endl;
 offset += flags;
 char age;
 char height;
 if (flag == 3)
 {
 age = data[offset];
 height = data[offset + 1];
 offset += age_byte + height_byte;
 std::cout << "age:" << (int)age << std::endl;
 std::cout << "height:" << (int)height << std::endl;
 }
 else if (flag == 2)
 {
 height = data[offset];
 offset += height_byte;
 std::cout << "height:" << (int)height << std::endl;
 }
 else if (flag == 1)
 {
 age = data[offset];
 offset += age_byte;
 std::cout << "age:" << (int)age << std::endl;
 }
 }
 int checkSum;
 memcpy(&checkSum, &data[size - 4], sizeof(int));
 std::cout << "checksum:" << checkSum << std::endl;
}

Question 2

(Welcome to CR!) It might help if you explicitly stated what you are looking for in reviews. Can you provide a sample input with, say, 2 entries? Does your program work for that input?

Question 3

Just curious about the close votes. They currently say either that the question doesn't include code or the code is broken, or that the code is pseudocode or hypothetical code or someone else's code. Clearly, the code is here, so it's not the first one. I've tested it on a simple file and it works fine, so it's not broken. It's not pseudocode or hypothetical code. I can't say whether it's someone else's, but these close votes make no sense otherwise. Downvotes are understandable given the lack of description, but not close votes.

Question 4

@user1118321: tested it on a simple file and it works fine glad to hear, having 2nd thoughts reading inputFile(...)...streamoff size = inputFile.tellg() ... new char[size] (ate was beyond the right margin of CR's code peephole...).

Question 5

Your question currently leaves some things to be desired. I'd recommend taking a look at Simon's Guide to posting a good question. In particular, you could improve the formatting of your question (provide a pure english part outside the code block) and state whether or not you have tested this code.

Question 6

I've added a new question which includes C++ code that generates a test file for this format.

Question 7

I like the idea of being able to checksum a file to see if it's been altered or corrupted. That can be very useful in a variety of situations! Here are some thoughts on how you could improve it.

Simplify

I find this loop a little hard to understand at first:

 int checkSum = 0;
 for (int i = 4; i <= size - 4; i = i + 4)
 {
 checkSum += (data[i - 1] << 24) |
 (data[i - 2] << 16) |
 (data[i - 3] << 8) |
 (data[i - 4]);
 }

Starting i at 4 and then accessing data[ i - # ] seems a lot harder to understand than starting i at 0 and accessing data[ i + # ]. I think this is the first time I've ever seen it done this way. It's not wrong, it's just different from what 99% of programmers would do, so it takes longer to understand. Note that your checksum technique assumes it's running on a little endian architecture. That's probably fine here, but it's worth noting that there are other architectures out there where this would produce incorrect results. A comment explicitly pointing that out would be sufficient here, in my opinion.

Careful with Assumptions

In getting the checksum out of the file you've made an assumption that's not necessarily true:

 int checksumfromFile = 0;
 memcpy(&checksumfromFile, &data[size - 4], sizeof(int));

This explicitly claims that sizeof(int) is 4 bytes. That is likely to be true on common architectures today, but isn't guaranteed by the language's standard. However, the specification for this file format does say the checksum will be 4 bytes. So you need to ensure that you only attempt to copy 4 bytes from the file. I recommend using a sized type like int32_t instead of int since you know that you actually need 4 bytes:

int32_t checksumfromFile = 0;
memcpy(&checksumfromFile, &data[ size - sizeof(checksumfromFile) ], sizeof(checksumfromFile));

I also set the thing inside of sizeof() to be the thing you're writing to. This will ensure you never write past the end of the variable you're copying data into.

Remove Unused Variables

The variable checksum is unused in the ParseFile() function. You should remove it. You should also turn up the warnings on your compiler to tell you about these sorts of issues.

Put Common Code into a Function

If you look at the code for reading in the first name and the last name, you'll see that it's largely the same. I recommend making either a single function to read a length and a string, or 2 functions – 1 to get the length and 1 to get the actual string:

int32_t getStringLength(const char* data, const size_t offset)
{
 int32_t length = 0;
 memcpy(&length, data [ offset ], sizeof(length);
 return length;
}
char* copyString(const char* from, const int32_t length)
{
 char* to = new char [ length + 1 ];
 if (to != nullptr)
 {
 memcpy(to, from, length);
 to [ length ] = '0円';
 }
 
 return to;
}

Then you can call it in 2 places like so:

for (int i = 0; i < entryCount; i++)
{
 //firstNameLength
 int32_t lengthofFirstName = getStringLength(data, offset);
 std::cout << "first name length:" << lengthofFirstName << std::endl;
 offset += firstNameLength;
 //firstName
 char* firstName = copyString(data, offset, lengthofFirstName);
 if (firstName != nullptr)
 {
 std::cout << "first name:" << firstName << "\n";
 }
 offset += lengthofFirstName;
 // ... etc.

Then you can do the same thing for the last name. I fixed a memory bug when writing the function, so if you use the function, it's fixed in both cases. You were allocating 1 byte too few for the name plus terminating NUL character.

Naming

Most of your naming is good. That's rare! I would change a few things, though. First, you're using camelCase, but you're not actually capitalizing each word. You're leaving prepositions lowercase. I would capitalize them. It's not a huge deal, but is unexpected.

Next, the names firstNameLength and lengthofFirstName are too close and confusing. firstNameLength is not the length of the first name. It's the length of the data that holds the length of the first name. Why not call it something like sizeBytes or firstNameSizeBytes or numBytesFirstNameSize or something that explains how it differs from lengthofFirstName? Same with the last name. Perhaps you could have a named constant at the top of the file that holds the size of each record element? Something like this:

const size_t kNumBytesFirstNameSize = 4;
const size_t kNumBytesLastNameSize = 4;
const size_t kNumBytesFlagsSize = 1;
const size_t kNumBytesChecksum = 4;

I would also add named constants for what the flags mean:

enum PersonFlag {
 kAgeMask = 0x01,
 kHeightMask = 0x02
}

Then you can check them via:

if (flag & kAgeMask) 
{
 // ...read in age...
}
if (flag & kHeightMask)
{
 // ... read in height ...
}

Note that there's less code here. You only need 2 checks instead of 3.

user1118321 user1118321 11.9k1 gold badge20 silver badges46 bronze badges · Answer 1 · 2018-02-19 03:33:21Z

I like the idea of being able to checksum a file to see if it's been altered or corrupted. That can be very useful in a variety of situations! Here are some thoughts on how you could improve it.

Simplify

I find this loop a little hard to understand at first:

 int checkSum = 0;
 for (int i = 4; i <= size - 4; i = i + 4)
 {
 checkSum += (data[i - 1] << 24) |
 (data[i - 2] << 16) |
 (data[i - 3] << 8) |
 (data[i - 4]);
 }

Starting i at 4 and then accessing data[ i - # ] seems a lot harder to understand than starting i at 0 and accessing data[ i + # ]. I think this is the first time I've ever seen it done this way. It's not wrong, it's just different from what 99% of programmers would do, so it takes longer to understand. Note that your checksum technique assumes it's running on a little endian architecture. That's probably fine here, but it's worth noting that there are other architectures out there where this would produce incorrect results. A comment explicitly pointing that out would be sufficient here, in my opinion.

Careful with Assumptions

In getting the checksum out of the file you've made an assumption that's not necessarily true:

 int checksumfromFile = 0;
 memcpy(&checksumfromFile, &data[size - 4], sizeof(int));

This explicitly claims that sizeof(int) is 4 bytes. That is likely to be true on common architectures today, but isn't guaranteed by the language's standard. However, the specification for this file format does say the checksum will be 4 bytes. So you need to ensure that you only attempt to copy 4 bytes from the file. I recommend using a sized type like int32_t instead of int since you know that you actually need 4 bytes:

int32_t checksumfromFile = 0;
memcpy(&checksumfromFile, &data[ size - sizeof(checksumfromFile) ], sizeof(checksumfromFile));

I also set the thing inside of sizeof() to be the thing you're writing to. This will ensure you never write past the end of the variable you're copying data into.

Remove Unused Variables

The variable checksum is unused in the ParseFile() function. You should remove it. You should also turn up the warnings on your compiler to tell you about these sorts of issues.

Put Common Code into a Function

If you look at the code for reading in the first name and the last name, you'll see that it's largely the same. I recommend making either a single function to read a length and a string, or 2 functions – 1 to get the length and 1 to get the actual string:

int32_t getStringLength(const char* data, const size_t offset)
{
 int32_t length = 0;
 memcpy(&length, data [ offset ], sizeof(length);
 return length;
}
char* copyString(const char* from, const int32_t length)
{
 char* to = new char [ length + 1 ];
 if (to != nullptr)
 {
 memcpy(to, from, length);
 to [ length ] = '0円';
 }
 
 return to;
}

Then you can call it in 2 places like so:

for (int i = 0; i < entryCount; i++)
{
 //firstNameLength
 int32_t lengthofFirstName = getStringLength(data, offset);
 std::cout << "first name length:" << lengthofFirstName << std::endl;
 offset += firstNameLength;
 //firstName
 char* firstName = copyString(data, offset, lengthofFirstName);
 if (firstName != nullptr)
 {
 std::cout << "first name:" << firstName << "\n";
 }
 offset += lengthofFirstName;
 // ... etc.

Then you can do the same thing for the last name. I fixed a memory bug when writing the function, so if you use the function, it's fixed in both cases. You were allocating 1 byte too few for the name plus terminating NUL character.

Naming

Most of your naming is good. That's rare! I would change a few things, though. First, you're using camelCase, but you're not actually capitalizing each word. You're leaving prepositions lowercase. I would capitalize them. It's not a huge deal, but is unexpected.

Next, the names firstNameLength and lengthofFirstName are too close and confusing. firstNameLength is not the length of the first name. It's the length of the data that holds the length of the first name. Why not call it something like sizeBytes or firstNameSizeBytes or numBytesFirstNameSize or something that explains how it differs from lengthofFirstName? Same with the last name. Perhaps you could have a named constant at the top of the file that holds the size of each record element? Something like this:

const size_t kNumBytesFirstNameSize = 4;
const size_t kNumBytesLastNameSize = 4;
const size_t kNumBytesFlagsSize = 1;
const size_t kNumBytesChecksum = 4;

I would also add named constants for what the flags mean:

enum PersonFlag {
 kAgeMask = 0x01,
 kHeightMask = 0x02
}

Then you can check them via:

if (flag & kAgeMask) 
{
 // ...read in age...
}
if (flag & kHeightMask)
{
 // ... read in height ...
}

Note that there's less code here. You only need 2 checks instead of 3.

Stack Exchange Network

Parse data file and verify checksum

File structure (Little Endian):

Field descriptions:

entry

FileParser.cpp

1 Answer 1

Simplify

Careful with Assumptions

Remove Unused Variables

Put Common Code into a Function

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Parse data file and verify checksum

File structure (Little Endian):

Field descriptions:

entry

FileParser.cpp

1 Answer 1

Simplify

Careful with Assumptions

Remove Unused Variables

Put Common Code into a Function

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions