How can I improve this hash matching program using either abstraction or search/sort algorithms?
We briefly were introduced to searching methods such as bubble sort and merge sort, but I'm not sure how they can be implemented to improve my code.
The program takes a command-line hash input, compares it against generated hashes, and if it finds a match, prints out the key used to generate the matched hash.
(This is part of Harvard's CS50 on eDx Week 2 Problem Set)
#include <cs50.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <crypt.h>
#define _XOPEN_SOURCE
#include <unistd.h>
/*
* Goal: Program that takes cmd ln arg of password hash
* and compares it to program generated hashes until a
* match is found.
* Assumptions: password is max 5 chars && all chars alpha
*/
bool crack(string hash);
int main(int argc, string argv[])
{
// Check for 2 command line arguments
if (argc != 2)
{
printf("[ERROR 1] Usage: ./crack hash\n");
return 1;
}
string hash = argv[1];
if (!crack(hash))
{
printf("[ERROR 2] Could not crack! \n");
return 1;
}
}
bool crack(string hash)
{
string alphas = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
char key[6];
char salt[3];
salt[0] = hash[0];
salt[1] = hash[1];
salt[2] = '0円';
// 1 char pw test
for (int i = 0; i < strlen(alphas); i++)
{
// Iterate over alpha chars to generate test key
key[0] = alphas[i];
key[1] = '0円';
// Create new test hash
string crackhash = crypt(key, salt);
// Compare hashes
if(strcmp(crackhash, hash) == 0)
{
printf("%s \n", key); // Array of chars is a string
return true;
}
}
// 2 char pw test
for (int i = 0; i < strlen(alphas); i++)
{
key[0] = alphas[i];
for (int j = 0; j < strlen(alphas); j++)
{
key[1] = alphas[j];
key[2] = '0円';
char* crackhash = crypt(key, salt);
if(strcmp(crackhash, hash) == 0)
{
printf("%s \n", key);
return true;
}
}
}
// 3 char pw test
for (int i = 0; i < strlen(alphas); i++)
{
key[0] = alphas[i];
for (int j = 0; j < strlen(alphas); j++)
{
key [1] = alphas[j];
for (int k = 0; k < strlen(alphas); k++)
{
key[2] = alphas[k];
key[3] = '0円';
char* crackhash = crypt(key, salt);
if(strcmp(crackhash, hash) == 0)
{
printf("%s \n", key);
return true;
}
}
}
}
// 4 char pw test
for (int i = 0; i < strlen(alphas); i++)
{
key[0] = alphas[i];
for (int j = 0; j < strlen(alphas); j++)
{
key [1] = alphas[j];
for (int k = 0; k < strlen(alphas); k++)
{
key [2] = alphas[k];
for (int l = 0; l < strlen(alphas); l++)
{
key[3] = alphas[l];
key[4] = '0円';
char* crackhash = crypt(key, salt);
if(strcmp(crackhash, hash) == 0)
{
printf("%s \n", key);
return true;
}
}
}
}
}
// 5 char pw test
for (int i = 0; i < strlen(alphas); i++)
{
key[0] = alphas[i];
for (int j = 0; j < strlen(alphas); j++)
{
key [1] = alphas[j];
for (int k = 0; k < strlen(alphas); k++)
{
key [2] = alphas[k];
for (int l = 0; l < strlen(alphas); l++)
{
key[3] = alphas[l];
for (int m = 0; m < strlen(alphas); m++)
{
key[4] = alphas[m];
key[5] = '0円';
char* crackhash = crypt(key, salt);
if(strcmp(crackhash, hash) == 0)
{
printf("%s \n", key);
return true;
}
}
}
}
}
}
return false;
}
-
1\$\begingroup\$ Do not edit your question once an answer has been posted. (See What should I do when someone answers my question?. If you have a new problem with updated code, ask a new question. \$\endgroup\$1201ProgramAlarm– 1201ProgramAlarm2019年08月09日 02:58:46 +00:00Commented Aug 9, 2019 at 2:58
4 Answers 4
Iterating over all possible keys
You have the problem that you need to nest more and more loops for longer passwords.
Instead of having nested for-loops, where each loop has an iterator that represents one character of the key, you should treat the whole key as one single iterator. For example, with a key of length 5, the lowest value of that iterator is "aaaaa"
, and the highest is "ZZZZZ"
. Now think of how you write a regular for-loop, and instead of int i
, use char key[]
as the iterator variable. Then, you have to check whether the key is valid; let's assume we will set it to the empty string if we reached the end. At the end of the loop we have to increment our iterator. That's too much work to do inside a single for-statement, but we can write functions for it. So the for-loop should look like:
for(char key[] = "aaaaa"; key[0] != '0円'; increment(key)) {
char *crackhash = crypt(key, salt);
if(strcmp(crackhash, hash) == 0) {
printf("%s\n", key);
return true;
}
}
The function increment()
should look like this:
void increment(char *key) {
static const char *alphas = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
/* Increment the first character of key.
If it was 'Z', then reset the first character to 'a',
and continue with the next character, otherwise exit.
If every character is 'Z', make the key an empty string.
*/
for (int i = 0; key[i] != '0円'; i++) {
char *value = strchr(alphas, key[i]);
ptrdiff_t pos = value - alphas;
pos++;
if (alphas[pos]) {
key[i] = alphas[pos];
break;
} else {
if (key[i + 1] == '0円') {
key[0] = '0円';
} else {
key[i] = alphas[0];
}
}
}
}
The first time increment()
is called with a key of length 5, key
will be "aaaaa"
. The function will then look at the first character, and look up where in the string alphas
this character appears. Since 'a'
is the first character in alphas
, pos
will be 0
. It then increments pos
to 1
, and will write back alphas[pos]
, which is 'b'
, to the first position in key
. Thus, the contents of key
will then be "baaaa"
. At this point, it breaks out of the for-loop and returns. The next time increment()
will be called, it will set key
to "caaaa"
, and so on, until it is "Zaaaa"
. When increment()
is called with that value of the key, it will notice that after incrementing pos
, alphas[pos]
is the NUL character at the end of the the string alphas
. It then sets the first character of key
to the first character in alphas
('a'
), and instead of breaking out of the for loop, it will continue with the second character of 'key'
. Here it does exactly the same as it did for the first character, so it sees an 'a'
and turns it into a 'b'
. So after "Zaaaa"
, key
will now become "abaaa"
. Next time it will again only increment the first character, until it becomes "Zbaaa"
, then the next key will be "acaaa"
, and so on until it reaches "ZZaaa"
. At that time, the for-loop goes all the way to the third character in key
, and the next value will be "aabaa"
. This goes on and on until key
is "ZZZZZ"
; when increment()
is called with this value, it will reach the point where it sees that there is no sixth character to increment (key[i + 1] == '0円'
is true), and will then set the first character of key
to the NUL byte, which means key
will become an empty string. This is the signal that it has completed iterating over all possible keys.
Note how we now just have two for loops, one in increment()
and one in the main function. Also note that nowhere did we actually have to specify the length of the key, so the above code works for any length of key! So we can write just one more, outer for-loop to check keys of all possible sizes:
char key[MAX_LEN + 1] = "";
for(int len = 1; len <= MAX_LEN; len++) {
for(memset(key, 'a', len); key[0] != '0円'; increment(key)) {
char *crackhash = crypt(key, salt);
...
}
}
The call to memset()
is used to set the initial value for the key. It sets the first len
bytes of key
to the value 'a'
. So with len
being 1, this results in key
becoming the string "a"
. If len is 2, then key
will become "aa"
, and so on. It uses the fact that all the other bytes have already been initialized to zero by char key[MAX_LEN + 1] = ""
.
-
1\$\begingroup\$ I don't know if this has a name. It's just for-loops. The only thing special is that the iterator variable is a string, and incrementing a string is just something you don't do often. Maybe it is analogous to en.wikipedia.org/wiki/Arbitrary-precision_arithmetic. \$\endgroup\$G. Sliepen– G. Sliepen2019年08月13日 05:46:56 +00:00Commented Aug 13, 2019 at 5:46
-
\$\begingroup\$ never seen MAX_LEN before, can you explain the usage in this context? It's giving errors. \$\endgroup\$stiction– stiction2019年08月14日 21:25:51 +00:00Commented Aug 14, 2019 at 21:25
-
1\$\begingroup\$ It's just a placeholder for the maximum length of keys you want to test. \$\endgroup\$G. Sliepen– G. Sliepen2019年08月14日 21:27:55 +00:00Commented Aug 14, 2019 at 21:27
-
\$\begingroup\$ Can you articulate what's going on in the increment() function? It works, but this seems cryptic for someone new to CS and C. Also, can you explain the use of memset(key, 'a', len) in this context? \$\endgroup\$stiction– stiction2019年08月17日 01:34:51 +00:00Commented Aug 17, 2019 at 1:34
stderr
I don't know the requirements of the problem, but in a real program, error messages should be written to stderr
:
fprintf(stderr, "[ERROR] Whatever\n");
char *
If CS50 allows it, use char *
instead of string
. It's the same (CS50 writes a typedef
in <cs50.h>
), but you will have it more present.
Thanks to @TobySpeight for noting that it's not only something visual, but also the typedef
(string
) has problems when mixed with const
(the behaviour is non-obvious; const
is applied to the pointer, and not to the variable type).
Don't leave whitespace in a line ending
printf("%s \n", key);
This code will write a space at the end of a line, which is useless, and sometimes not good at all.
DRY
Don't repeat yourself: the 5 loops are very similar. In fact the inner loop of each one is identical and could be put inside a function. Repetition usually leads to typos and bugs.
Compiler specific optimizations
AFAIK, CS50 uses Clang, which can apply optimizations beyond what you could write in standard C. To do that, you need to write code that only some compilers will understand (you can write that code inside #if
s to support other compilers, or decide that your code will only support a specific set of compilers). GCC, Clang, and possibly other compilers make use of function attributes to tell the compiler that a function is able to be optimized more than what it may think.
For example, @chux told in his answer that good compilers know that even if you write strlen
many times, as long as the string isn't modified in between, the compiler can reuse the length from the last call. The compiler uses one of those attributes to know that (it isn't black magic). This is what the prototype for strlen
might look like in Clang's libc:
__attribute__((pure))
size_t strlen(const char *s);
Your function crack has the same property: if you call it twice without modifying the string, the return value will be the same (and it has no side effects). You could write your prototype as:
__attribute__((pure))
bool crack(const char *hash);
You could write a macro to support compilers that don't understand __attribute__
:
#if defined(__GNUC__)
#define pure__ __attribute__((pure))
#else
#define pure__
#endif
If you want other compilers to be able to optimize your functions as pure but use different constructions for that, you can extend this macro. The usage would be:
pure__
bool crack(const char *hash);
Note: there are other attributes that would be applicable to that function. I leave it to you to find them (read the link above).
static
Functions that are private to a .c
file should be declared static
. crack
is one of those:
__attribute__((pure))
static
bool crack(const char *hash);
...
static
bool crack(const char *hash)
{
...
}
-
1\$\begingroup\$ Why do you say to prefer char*? \$\endgroup\$Josiah– Josiah2019年08月03日 06:30:17 +00:00Commented Aug 3, 2019 at 6:30
-
4\$\begingroup\$ @Josiah Because it's the reality.
string
doesn't exist in C. \$\endgroup\$alx - recommends codidact– alx - recommends codidact2019年08月03日 07:33:41 +00:00Commented Aug 3, 2019 at 7:33 -
\$\begingroup\$ @Josiah The
bool
variable type also doesn't exist in C unless stdbool.h is included. \$\endgroup\$2019年08月03日 17:04:14 +00:00Commented Aug 3, 2019 at 17:04 -
1\$\begingroup\$ @pacmaninbw But
_Bool
exists always; it's not the same (since C99, of course). It would have been a good comparison 21 years ago, but I never use C89, and I would recommend everyone to do the same. Even in Linux they are trying to move to C11 :) \$\endgroup\$alx - recommends codidact– alx - recommends codidact2019年08月03日 18:34:57 +00:00Commented Aug 3, 2019 at 18:34 -
1\$\begingroup\$ To clarify, I meant the Linux Kernel code, which now uses GNU C89 (
gnu89
), but they want to move tognu11
. In Linux, using GCC, C17 is almost fully supported. @pacmaninbw I only wrote one C program before 1999, and none before 1989 (I was born in '93), so I practically learnt C already withbool
(lucky me); it's very sad that Windows has such bad support for C (not only C99, but C), many decades after. \$\endgroup\$alx - recommends codidact– alx - recommends codidact2019年08月03日 20:19:13 +00:00Commented Aug 3, 2019 at 20:19
I agree with everything @CacahueteFrito mentioned in their answer. There is one observation there I would like to expand on, and one observation I would like to add.
Use the Type size_t When Indexing Arrays
Each of the for
loops in the function bool crack(char *hash)
uses a type int
variable as the loop control variable. When indexing into arrays, it is better to use the type size_t which is defined as unsigned int in both C and C++. The C library functions strlen and sizeof
return the type size_t
.
One of the benefits of using an unsigned index is that a negative number can not be used to index an array. Since arrays in C start at zero this is important and prevent bugs. A second possible benefit of using size_t
is that it may prevent warning messages about type mismatches from some compilers when using functions such as strlen
or sizeof
.
Complexity of Functions
When @CacahueteFrito mentions the DRY Programming Principle he is providing a fix for function complexity. There is another programming principle that is involved here as well, the Single Responsibility Principle. The single responsibility principle states
... every module, class, or function should have responsibility over a single part of the functionality provided by the software.
While the single responsibility principle primarily applies to object oriented programming, the function part of the statement refers to non object oriented programming as well. Well written functions will have a single goal to accomplish and will generally be short. In the code there are comments of the form // 2 char pw test
, each of the code blocks following these comments could be a function. Well named functions here would reduce the need for these comments and the overall effect would be a much simpler version of the function `bool crack(char* hash).
Small simple functions are much easier to debug and maintain then functions that are over 100 lines long. This would also decrease the scope of variables such as char key[6]
.
-
3\$\begingroup\$ I agree in everything but the type. I agree that there is (actually, are) a specialized type for arrays, though. To avoid giving my subjective point of view, here are both types with their pros and cons:
ptrdiff_t
(signed) vssize_t
(unsigned). But the point is there: don't useint
for accessing arrays. \$\endgroup\$alx - recommends codidact– alx - recommends codidact2019年08月03日 18:53:37 +00:00Commented Aug 3, 2019 at 18:53 -
\$\begingroup\$ @CacahueteFrito nice link! One gets both views from the answers. \$\endgroup\$2019年08月03日 19:13:30 +00:00Commented Aug 3, 2019 at 19:13
-
\$\begingroup\$ Appreciate the articulation. My goal was to have working code, and then see how I can optimize it from there. As for DRY principles, I am going to start with the easiest abstraction here which is the part where it generates a crackhash and tests versus the hash into a function. The harder part conceptually, for me as a novice programmer, is how I can write a function that generates another char in the key, and adds the next loop until the length of the key reaches a certain point. \$\endgroup\$stiction– stiction2019年08月06日 02:30:44 +00:00Commented Aug 6, 2019 at 2:30
-
\$\begingroup\$
size_t
(andstd::size_t
) is an unsigned type, but not necessarilyunsigned int
. That's kind of the point: it can beunsigned long
orunsigned long long
as appropriate for the process's address space. \$\endgroup\$Toby Speight– Toby Speight2019年08月06日 08:44:46 +00:00Commented Aug 6, 2019 at 8:44
Below can lead to terrible performance.
A good compiler will recognize alphas
does not change and so perform strlen(alphas)
only once - which time complexity is the length of the string - say N
.
A lesser compiler will execute strlen(alphas)
each and every time, thus the 5 char pw test
loops will cost N*N*N*N*N
time!
// 5 char pw test
for (int i = 0; i < strlen(alphas); i++)
{
key[0] = alphas[i];
for (int j = 0; j < strlen(alphas); j++)
{
....
Instead simply use
size_t n = strlen(alphas);
for (size_t i = 0; i < n; i++)
{
key[0] = alphas[i];
for (size_t j = 0; j < n; j++)
{
....
Or idiomatic C and look for the null character.
for (size_t i = 0; alphas[i]; i++)
{
key[0] = alphas[i];
for (size_t j = 0; alphas[j]; j++)
{
....
Code only needed once. Recommend to move key[N] = '0円';
to just before each set of loops.
key[5] = '0円';
Minor: Use const
Some compiler will emit more efficient code when told that hash
does not point to changeable data. Some compilers may deduce this already. Also see @Toby Speight
// bool crack(string hash)
bool crack(const char *hash)
-
\$\begingroup\$ Awesome stuff here. These are optimizations I was hoping to learn. The 5 char password hashes were indeed taking a significant amount of time to crack. Going to implement these changes and note the difference in run time. Can you elaborate on moving the null character for the key array to outside the loops. Since each iteration is is a different length (example one password is OK another is YES), how would you tell it to be at the end regardless of the number of chars in the cracked key. \$\endgroup\$stiction– stiction2019年08月06日 02:39:12 +00:00Commented Aug 6, 2019 at 2:39
-
\$\begingroup\$ @stiction Every iteration within
// 5 char pw test for (int i = 0; i < strlen(alphas); i++)
is 5 sokey[5] = '0円';
can precede those loops. Every iteration within// 4 char pw test for (int i = 0; i < strlen(alphas); i++)
is 4 sokey[4] = '0円';
can precede those loops. ... \$\endgroup\$chux– chux2019年08月06日 04:25:47 +00:00Commented Aug 6, 2019 at 4:25 -
\$\begingroup\$ Typo:
for (size_t j = 0; alphas[i]; j++)
->for (size_t j = 0; alphas[j]; j++)
\$\endgroup\$alx - recommends codidact– alx - recommends codidact2019年08月06日 07:15:44 +00:00Commented Aug 6, 2019 at 7:15 -
\$\begingroup\$ @stiction CS50 uses Clang (AFAIK), which is a good compiler; you shouldn't notice any difference with the
strlen
calls. In fact, what Clang (probably) uses to know that it only needs to callstrlen
once is something special:__attribute__((pure))
. I will add that to my answer, because you can apply it to your function. \$\endgroup\$alx - recommends codidact– alx - recommends codidact2019年08月06日 07:51:42 +00:00Commented Aug 6, 2019 at 7:51 -
2\$\begingroup\$
const string
is reallychar *const
, notchar const*
(thus demonstrating the problems with that typedef). \$\endgroup\$Toby Speight– Toby Speight2019年08月06日 08:42:11 +00:00Commented Aug 6, 2019 at 8:42