Cryptopals Challenge 1.6: Breaking Repeating-Key XOR

I’ll be honest, this one took a bit of time for me to solve it. The goal: you’re given a base64-encoded file that was encrypted with repeating-key XOR, and you have to figure out both the key length and the key itself, with no hints other than the algorithm description.

The General Idea

Before touching any code, it helps to write down how to solve the question. When you XOR a byte with a key byte, the result depends on both values. But here’s the trick: if you take two bytes that were encrypted with the same key byte and XOR them together, the key cancels out and you’re left with just the XOR of two plaintext bytes. English text has predictable statistical properties, so the XOR of two English characters tends to have fewer set bits than the XOR of two random bytes. This is the entire basis of the attack.

So the plan is:

  1. Find the key length (KEYSIZE) using Hamming distance
  2. Split the ciphertext into KEYSIZE-sized blocks and transpose them
  3. Solve each transposed block as single-byte XOR (already done in challenge 3)
  4. Assemble the key bytes into the full key and decrypt

Step 1 — Base64 Decoding

The file comes base64-encoded, so the first thing is decoding it. With Crypto++ you can’t feed an ifstream directly into StringSource, so you read the file into a string first using iterators:

std::ifstream file("6.txt");
std::string encoded((std::istreambuf_iterator<char>(file)),
                     std::istreambuf_iterator<char>());
std::string decoded;
StringSource ss(encoded, true,
                new Base64Decoder(new StringSink(decoded)));

Step 2 — Hamming Distance

The challenge asks you to implement Hamming distance between two byte strings. Hamming distance is just the number of bits that differ between them. The elegant way to compute this: XOR the two bytes (giving 1 wherever bits differ) and then count the set bits with popcount.

I initially tried to implement this without looking up how to do so and I coincidentally came out with Hamming weight which is counting set bits from zero. That’s related but not the same thing. Hamming distance is between two values, Hamming weight is the distance of a single value from zero.

int hammingDistance(const std::vector<uint8_t>& a,
                    const std::vector<uint8_t>& b) {
    int dist = 0;
    for (size_t i = 0; i < a.size(); i++) {
        dist += __builtin_popcount(a[i] ^ b[i]);
    }
    return dist;
}

The challenge gives a test case: "this is a test" vs "wokka wokka!!!" should give 37. Running this function against those two strings confirmed it was correct before moving on. (yay!)

Step 3 — Finding KEYSIZE

This is where I got stuck the longest. The idea is: for each candidate KEYSIZE, take chunks of that size from the ciphertext, compute their normalized Hamming distance, and the KEYSIZE with the lowest score is probably correct.

Normalizing (dividing by KEYSIZE) is necessary for the same reason you average when comparing groups of different sizes, larger KEYSIZEs produce higher raw distances just because you’re comparing more bytes, not because they’re worse guesses.

My first attempt only compared two chunks, which was too noisy and kept returning wrong answers like KEYSIZE 2 or 5. The fix was to average across all consecutive block pairs in the entire ciphertext:

for (int i = 2; i <= 40; i++) {
    if (2*i > (int)decoded.size()) continue;
 
    int numBlocks = decoded.size() / i;
    double score = 0;
    int count = 0;
 
    for (int j = 0; j < numBlocks - 1; j++) {
        std::vector<uint8_t> chunk1(decoded.begin() + j*i,
                                    decoded.begin() + (j+1)*i);
        std::vector<uint8_t> chunk2(decoded.begin() + (j+1)*i,
                                    decoded.begin() + (j+2)*i);
        score += (double)hammingDistance(chunk1, chunk2) / i;
        count++;
    }
 
    score /= count;
    scores.push_back({score, i});
}

After sorting by score, KEYSIZE 29 came out on top. The challenge also hints at printing the top 2-3 candidates, which is good practice in case the first guess turns out to be wrong.

Step 4 — Transposing the Blocks

Once you know KEYSIZE, you split the ciphertext into KEYSIZE-length blocks and transpose them. Transposing means: instead of [block0_byte0, block0_byte1, ...], you make new blocks where the Nth transposed block contains the Nth byte from every original block.

The reason this works is every byte at position N was encrypted with the same key byte (key[N % KEYSIZE]). So transposed block N is just a single-byte XOR problem in disguise.

So let’s implement it like this:

std::vector<std::vector<uint8_t>> transposed(KEYSIZE);
for (int i = 0; i < (int)decoded.size(); i++) {
    transposed[i % KEYSIZE].push_back(decoded[i]);
}

i % KEYSIZE naturally routes each byte to the correct transposed block. Byte 0 goes to block 0, byte 1 to block 1, byte 29 wraps back to block 0, and so on.

Step 5 — Solving Each Block

This is just challenge 3 wrapped in a function. For each transposed block, try all 256 possible key bytes, score the result using character frequency, and return the winner:

char findBestKey(const std::vector<uint8_t>& block) {
    int best_score = 0;
    char best_key = 0;
    for (int key = 0; key < 256; key++) {
        std::string temp_plaintext;
        for (size_t i = 0; i < block.size(); i++)
            temp_plaintext += block[i] ^ key;
        int current_score = score_calc(temp_plaintext);
        if (current_score > best_score) {
            best_score = current_score;
            best_key = key;
        }
    }
    return best_key;
}

Running this across all 29 transposed blocks and assembling the results gave the key: Terminator X: Bring the noise. The decrypted plaintext was Vanilla Ice’s “Play That Funky Music”.

You can reach the full code at my GitHub: https://github.com/AydoganArslantash/Cryptopals-Solutions/blob/main/set1/chal6.cpp