Cryptopals Challenge 1.3: Single-Byte XOR Cipher

This challenge was the first one that actually felt like real cryptography. We have a hex string that was XOR’d against a single unknown byte, and we need to figure out which byte it was and decrypt the message.

Challenge

This is where things start getting interesting. Instead of just converting between formats, we are now actually breaking a cipher.

Description

Here is what the challenge asks us to do:

The hex encoded string:

1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736

… has been XOR’d against a single character. Find the key, decrypt the message.

You can do this by hand. But don’t: write code to do it for you.

How? Devise some method for “scoring” a piece of English plaintext. Character frequency is a good metric. Evaluate each output and choose the one with the best score.

So we have a cypher-text that was encrypted by XOR’ing every byte with the same single character. Since a single byte can only be a value from 0 to 255, there are only 256 possible keys. We can just try all of them and see which one produces something that looks like English.

The real question is: how do we automatically tell the computer what “looks like English” means?

Solution

My First Idea: Use an NLP Library

The first thing that came to my mind was to create an ASCII map, XOR against the encoded string with every possible key, and then use some NLP (Natural Language Processing) library to evaluate which decrypted output makes the most sense as an English sentence. They basically use frequency map to do those stuff anyways, and that’s what we need.

But I quickly realized this would be unnecessarily heavy. We don’t need a full NLP library to figure out if something looks like English — we just need a simple heuristic. The challenge itself hints at this: “Character frequency is a good metric.”

The Scoring Approach

Instead of NLP, I decided to write a simple scoring function. The idea is:

  • English text has lots of spaces, so spaces get a high score
  • Common English letters like e, t, a, o, i, n appear frequently, so they get bonus points
  • Regular letters still get some points
  • Punctuation and digits are fine too
  • But if we see unprintable characters (stuff below ASCII 32 or above 126), that’s a strong signal that this is not English, so we penalize heavily

Here is the scoring function I came up with:

int score_calc(const std::string &text)
{
    int score = 0;
    for (char c : text) {
        // Spaces are very common in English
        if (c == ' ')
            score += 3;
        // Common letters (e t a o i n)
        else if (c == 'e' || c == 'E' || c == 't' || c == 'T' ||
                 c == 'a' || c == 'A' || c == 'o' || c == 'O' ||
                 c == 'i' || c == 'I' || c == 'n' || c == 'N')
            score += 2;
        // Less common but still valid letters
        else if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
            score += 1;
        // Punctuation and digits are acceptable
        else if ((c >= '0' && c <= '9') || c == '.' || c == ',' ||
                 c == '!' || c == '?' || c == '\'' || c == '"' ||
                 c == ';' || c == ':')
            score += 1;
        // Penalise unprintable characters
        else if (c < 32 || c > 126)
            score -= 5;
        else
            score += 0;
    }
    return score;
}

The weighting is pretty intuitive. Spaces get 3 points because English has a space roughly every 5 characters. The “ETAOIN” letters (the most frequent in English) get 2 points. Everything else that is printable gets 1 point. And unprintable garbage gets -5 because if we are seeing control characters, we are almost certainly looking at a wrong key.

Putting It All Together

Now we just need to try all 256 possible keys, XOR each one against our cypher-text, score the result, and keep track of the best one.

First, the usual setup — hex decode the input to raw bytes:

#include <iostream>
#include <cryptopp/hex.h>
#include <cryptopp/filters.h>
#include <cryptopp/base64.h>

using namespace CryptoPP;
std::string usr_input = "1b37373331363f78151b7f2b783431333d78397828372d"
                        "363c78373e783a393b3736";
                     
std::string cyphertext;
StringSource ss1(usr_input, true,
                 new HexDecoder(new StringSink(cyphertext)));

Then we brute-force all 256 possible single-byte keys. For each key, we XOR every byte of the cypher-text with that key (same idea as Challenge 2, but now every byte gets XOR’d with the same key instead of a different one):

int best_score = 0;
char best_key = 0;
std::string best_plaintext;

for (int key = 0; key < 256; key++) {
    std::string temp_plaintext;
    for (size_t i = 0; i < cyphertext.size(); i++) {
        temp_plaintext += cyphertext[i] ^ key;
    }
 
    int current_score = score_calc(temp_plaintext);
 
    if (current_score > best_score) {
        best_score = current_score;
        best_key = key;
        best_plaintext = temp_plaintext;
    }
}

And finally, print the winner:

std::cout << "Best key: " << (int)best_key << " ('" << best_key << "')\n";
std::cout << "Best score: " << best_score << "\n";
std::cout << "Message: " << best_plaintext << "\n";

Running this gives us the key 88 (which is the character 'X') and the decrypted message. I won’t spoil it here — go run the code yourself!

You can reach the full code at my GitHub: https://github.com/AydoganArslantash/Cryptopals-Solutions/blob/main/set1/chal3.cpp