ZITKT'L FG EKNOFU OF DQZITDQZOEL¶
References¶
- Introduction to Cryptography with Coding Theory (Second Edition) This textbook provides comprehensive coverage of cryptographic principles and coding theory, offering a strong theoretical foundation for understanding classical and modern cryptographic systems.
- Practical Cryptography an excellent online resource that offers insights into various cryptographic techniques and their practical applications. The section on simple substitution ciphers was particularly helpful in understanding the real-world implications of these classical methods.
- Google Scholar As a platform for accessing scholarly articles and papers, Google Scholar has been instrumental in reviewing the latest research and advancements in cryptanalysis techniques and automated methods.
- Project Gutenberg an online library of free eBooks. Project Gutenberg was the first provider of free electronic books, or eBooks. Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today.
Example: Substitution Cipher
# Example of a substitution cipher
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
key = 'QWERTYUIOPASDFGHJKLZXCVBNM'
message = "THERE'S NO CRYING IN MATHEMATICS"
encrypted_message = ''.join([key[alphabet.index(char)] if char in alphabet else char for char in message])
print('Encrypted:', encrypted_message)
Encrypted: ZITKT'L FG EKNOFU OF DQZITDQZOEL
History and Background of Substitution Ciphers¶
Substitution ciphers are one of the oldest methods of encryption, dating back to ancient times. They involve replacing each letter in the plaintext with another letter, number, or symbol, based on a fixed system or key.
Origins and Development¶
Ancient Roots: The use of substitution ciphers can be traced back to ancient civilizations, such as the Greeks and Romans. One of the earliest known substitution ciphers is the Caesar Cipher, named after Julius Caesar, who used it to encrypt military messages.
Middle Ages: During the Middle Ages, substitution ciphers were widely used in diplomatic and military communications. The Vigenère Cipher, introduced in the 16th century, was a significant advancement that employed a repeating key to increase complexity.
Renaissance and Beyond: As literacy and the exchange of information increased during the Renaissance, the need for more secure encryption methods became apparent. Substitution ciphers continued to evolve, incorporating more sophisticated techniques.
Historical Significance¶
Cryptanalysis Challenges: Despite their simplicity, substitution ciphers posed significant challenges to cryptanalysts for centuries. The development of frequency analysis, attributed to the Arab scholar Al-Kindi in the 9th century, marked a turning point in the ability to break these ciphers.
Modern Influence: Although simple substitution ciphers are no longer secure by today's standards, they laid the groundwork for modern cryptography. They introduced key concepts such as encryption keys, letter frequency analysis, and the importance of secrecy.
Importance in Cryptography¶
Substitution ciphers play a foundational role in the history of cryptography, illustrating the ongoing battle between code makers and code breakers. Understanding these early methods provides insight into the development of more complex encryption systems and the continuous evolution of cryptanalysis techniques.
The exploration of automated cryptanalysis, as presented in this notebook, builds upon these historical foundations, demonstrating how modern algorithms and technologies can unravel even the most sophisticated substitution ciphers.
Theoretical Complexity of Substitution Ciphers¶
Substitution ciphers operate by replacing each letter of the plaintext with another letter from the alphabet. A simple substitution cipher uses a fixed permutation of the alphabet as the key. Theoretically, this results in a vast number of possible key combinations.
Number of Possible Keys¶
Mathematical Calculation: For a substitution cipher using the English alphabet, there are 26 letters, each of which can be mapped to any of the other 25 remaining letters. This results in $26!$ (factorial of 26) possible permutations of the alphabet.
Numerical Perspective: The number of possible keys can be calculated as: $$ 26! = 26 \times 25 \times 24 \times \ldots \times 2 \times 1 $$
Approximate Value: The value of $26!$ is approximately $4.03 \times 10^{26}$, which is an astronomically large number.
Practical Limitations¶
Real-world Constraints: Despite the theoretical complexity, the actual security provided by substitution ciphers is limited. This is due to patterns in the language, such as letter frequency, which can be exploited by cryptanalysts.
Predictable Patterns: Languages like English have predictable patterns, such as the high frequency of certain letters (e.g., 'E', 'T', 'A') and common letter combinations (e.g., 'TH', 'HE'). These patterns reduce the effective key space that needs to be searched.
Cryptanalysis Techniques: Techniques like frequency analysis and pattern recognition significantly reduce the effort required to break substitution ciphers, making them insecure for modern applications.
Implications for Cryptanalysis¶
The existence of $26!$ possible keys underscores the theoretical complexity of substitution ciphers but also highlights the limitations of relying solely on permutation for security. Modern cryptanalysis techniques exploit language patterns to efficiently break these ciphers, demonstrating the importance of incorporating additional security measures in cryptographic systems.
Understanding the balance between theoretical complexity and practical vulnerabilities is crucial for advancing cryptographic research and developing robust encryption methods.
Mathematical Foundations of Cryptanalysis¶
Frequency Analysis¶
Frequency analysis is a cryptanalysis technique based on the study of the frequency of letters or groups of letters in a ciphertext. In the English language, certain letters and combinations of letters appear more frequently than others.
Key Points:
- Each language has a characteristic letter frequency distribution.
- By comparing the frequency of letters in the ciphertext with expected frequencies in the language, we can make educated guesses about the substitution used.
Example: Letter Frequency in English
%matplotlib inline
import matplotlib.pyplot as plt
def load_ngrams(file_path):
"""Load n-grams and their counts from a given file."""
ngram_counts = {}
with open(file_path, 'r') as f:
for line in f:
ngram, count = line.split()
ngram_counts[ngram] = int(count)
return ngram_counts
def calculate_frequency_percentages(ngram_counts):
"""Calculate frequency percentages for n-grams."""
total_count = sum(ngram_counts.values())
return {ngram: (count / total_count) * 100 for ngram, count in ngram_counts.items()}
def plot_frequencies(frequencies, title):
"""Plot n-gram frequencies as a bar chart."""
sorted_ngrams = sorted(frequencies.items(), key=lambda x: x[1], reverse=True)
ngrams, freq_percentages = zip(*sorted_ngrams[:10]) # Show top 10 for better visualization
plt.figure(figsize=(10, 6))
plt.bar(ngrams, freq_percentages, color='skyblue')
plt.title(title)
plt.xlabel('N-gram')
plt.ylabel('Frequency (%)')
plt.xticks(rotation=45)
plt.show()
# Load and process n-grams
MONOGRAMS = load_ngrams('data/english_monograms.txt')
BIGRAMS = load_ngrams('data/english_bigrams.txt')
TRIGRAMS = load_ngrams('data/english_trigrams.txt')
QUADGRAMS = load_ngrams('data/english_quadgrams.txt')
# Calculate frequencies
monogram_frequencies = calculate_frequency_percentages(MONOGRAMS)
bigram_frequencies = calculate_frequency_percentages(BIGRAMS)
trigram_frequencies = calculate_frequency_percentages(TRIGRAMS)
quadgram_frequencies = calculate_frequency_percentages(QUADGRAMS)
# Plot frequencies
plot_frequencies(monogram_frequencies, 'Monogram Frequency in English')
plot_frequencies(bigram_frequencies, 'Bigram Frequency in English')
plot_frequencies(trigram_frequencies, 'Trigram Frequency in English')
plot_frequencies(quadgram_frequencies, 'Quadgram Frequency in English')
Pattern Recognition¶
Pattern recognition involves identifying common linguistic patterns, such as repeated letters or sequences of letters, in the ciphertext. This can help identify likely substitutions.
Key Points:
- Identifying common patterns like double letters (e.g., "LL" in "HELLO") can help narrow down possible substitutions.
- Recognizing common word patterns (e.g., three-letter words like "THE") aids in deciphering.
Example: Double Letters and Common Words
import string
import random
def generate_random_key():
"""Generate a random substitution key."""
alphabet = list(string.ascii_uppercase)
random_key = alphabet[:]
random.shuffle(random_key)
return ''.join(random_key)
def encrypt_with_key(plaintext, key):
"""Encrypt the plaintext using the given substitution key."""
alphabet = string.ascii_uppercase
key_map = {alphabet[i]: key[i] for i in range(len(alphabet))}
ciphertext = ''.join(key_map.get(char, char) for char in plaintext.upper())
return ciphertext
# Generate a random key
original_key = generate_random_key()
print("Random Key:", original_key)
Random Key: TGPCOLVYBSKHWUENJQRFMIDAZX
# Example text for analysis
sample_text = "Thus determined, she reconnoitred the field, and practised her address so successfully, that in less than half an hour she was loaded with ermine and embroidery, and disposed to retreat with her burden, when her regards were solicited by a splendid bundle, which she descried at some distance lying on the ground. This was no other than an unhappy officer of hussars; who, after having the good fortune to take a Turkish standard, was desperately wounded in the thigh, and obliged to quit his horse; finding himself in such a helpless condition, he had wrapped his acquisition round his body, that whatever might happen, he and his glory should not be parted; and thus shrouded, among the dying and the dead, he had observed the progress of our heroine, who stalked about the field, like another Atropos, finishing, wherever she came, the work of death. He did not at all doubt, that he himself would be visited in the course of her peregrinations, and therefore provided for her reception, with a pistol ready cocked in his hand, while he lay perdue beneath his covert, in all appearance bereft of life. He was not deceived in his prognostic; she no sooner eyed the golden crescent than, inflamed with curiosity or cupidity, she directed thitherward her steps, and discerning the carcase of a man, from which, she thought, there would be a necessity for disengaging it, she lifted up her weapon, in order to make sure of her purchase; and in the very instant of discharging her blow, received a brace of bullets in her brain.".upper()
print("Sample Text: ", sample_text, "\n")
# Encrypt the plaintext using the random key
ciphertext = encrypt_with_key(sample_text, original_key)
print("Ciphertext: ", ciphertext)
Sample Text: THUS DETERMINED, SHE RECONNOITRED THE FIELD, AND PRACTISED HER ADDRESS SO SUCCESSFULLY, THAT IN LESS THAN HALF AN HOUR SHE WAS LOADED WITH ERMINE AND EMBROIDERY, AND DISPOSED TO RETREAT WITH HER BURDEN, WHEN HER REGARDS WERE SOLICITED BY A SPLENDID BUNDLE, WHICH SHE DESCRIED AT SOME DISTANCE LYING ON THE GROUND. THIS WAS NO OTHER THAN AN UNHAPPY OFFICER OF HUSSARS; WHO, AFTER HAVING THE GOOD FORTUNE TO TAKE A TURKISH STANDARD, WAS DESPERATELY WOUNDED IN THE THIGH, AND OBLIGED TO QUIT HIS HORSE; FINDING HIMSELF IN SUCH A HELPLESS CONDITION, HE HAD WRAPPED HIS ACQUISITION ROUND HIS BODY, THAT WHATEVER MIGHT HAPPEN, HE AND HIS GLORY SHOULD NOT BE PARTED; AND THUS SHROUDED, AMONG THE DYING AND THE DEAD, HE HAD OBSERVED THE PROGRESS OF OUR HEROINE, WHO STALKED ABOUT THE FIELD, LIKE ANOTHER ATROPOS, FINISHING, WHEREVER SHE CAME, THE WORK OF DEATH. HE DID NOT AT ALL DOUBT, THAT HE HIMSELF WOULD BE VISITED IN THE COURSE OF HER PEREGRINATIONS, AND THEREFORE PROVIDED FOR HER RECEPTION, WITH A PISTOL READY COCKED IN HIS HAND, WHILE HE LAY PERDUE BENEATH HIS COVERT, IN ALL APPEARANCE BEREFT OF LIFE. HE WAS NOT DECEIVED IN HIS PROGNOSTIC; SHE NO SOONER EYED THE GOLDEN CRESCENT THAN, INFLAMED WITH CURIOSITY OR CUPIDITY, SHE DIRECTED THITHERWARD HER STEPS, AND DISCERNING THE CARCASE OF A MAN, FROM WHICH, SHE THOUGHT, THERE WOULD BE A NECESSITY FOR DISENGAGING IT, SHE LIFTED UP HER WEAPON, IN ORDER TO MAKE SURE OF HER PURCHASE; AND IN THE VERY INSTANT OF DISCHARGING HER BLOW, RECEIVED A BRACE OF BULLETS IN HER BRAIN. Ciphertext: FYMR COFOQWBUOC, RYO QOPEUUEBFQOC FYO LBOHC, TUC NQTPFBROC YOQ TCCQORR RE RMPPORRLMHHZ, FYTF BU HORR FYTU YTHL TU YEMQ RYO DTR HETCOC DBFY OQWBUO TUC OWGQEBCOQZ, TUC CBRNEROC FE QOFQOTF DBFY YOQ GMQCOU, DYOU YOQ QOVTQCR DOQO REHBPBFOC GZ T RNHOUCBC GMUCHO, DYBPY RYO CORPQBOC TF REWO CBRFTUPO HZBUV EU FYO VQEMUC. FYBR DTR UE EFYOQ FYTU TU MUYTNNZ ELLBPOQ EL YMRRTQR; DYE, TLFOQ YTIBUV FYO VEEC LEQFMUO FE FTKO T FMQKBRY RFTUCTQC, DTR CORNOQTFOHZ DEMUCOC BU FYO FYBVY, TUC EGHBVOC FE JMBF YBR YEQRO; LBUCBUV YBWROHL BU RMPY T YOHNHORR PEUCBFBEU, YO YTC DQTNNOC YBR TPJMBRBFBEU QEMUC YBR GECZ, FYTF DYTFOIOQ WBVYF YTNNOU, YO TUC YBR VHEQZ RYEMHC UEF GO NTQFOC; TUC FYMR RYQEMCOC, TWEUV FYO CZBUV TUC FYO COTC, YO YTC EGROQIOC FYO NQEVQORR EL EMQ YOQEBUO, DYE RFTHKOC TGEMF FYO LBOHC, HBKO TUEFYOQ TFQENER, LBUBRYBUV, DYOQOIOQ RYO PTWO, FYO DEQK EL COTFY. YO CBC UEF TF THH CEMGF, FYTF YO YBWROHL DEMHC GO IBRBFOC BU FYO PEMQRO EL YOQ NOQOVQBUTFBEUR, TUC FYOQOLEQO NQEIBCOC LEQ YOQ QOPONFBEU, DBFY T NBRFEH QOTCZ PEPKOC BU YBR YTUC, DYBHO YO HTZ NOQCMO GOUOTFY YBR PEIOQF, BU THH TNNOTQTUPO GOQOLF EL HBLO. YO DTR UEF COPOBIOC BU YBR NQEVUERFBP; RYO UE REEUOQ OZOC FYO VEHCOU PQORPOUF FYTU, BULHTWOC DBFY PMQBERBFZ EQ PMNBCBFZ, RYO CBQOPFOC FYBFYOQDTQC YOQ RFONR, TUC CBRPOQUBUV FYO PTQPTRO EL T WTU, LQEW DYBPY, RYO FYEMVYF, FYOQO DEMHC GO T UOPORRBFZ LEQ CBROUVTVBUV BF, RYO HBLFOC MN YOQ DOTNEU, BU EQCOQ FE WTKO RMQO EL YOQ NMQPYTRO; TUC BU FYO IOQZ BURFTUF EL CBRPYTQVBUV YOQ GHED, QOPOBIOC T GQTPO EL GMHHOFR BU YOQ GQTBU.
N-gram Analysis Using spaCy¶
In this section, we perform n-gram analysis on a sample text using spaCy
. This library is a powerful natural language processing tool that offers efficient tokenization and linguistic processing capabilities.
Key Concepts¶
- Monograms: Single characters. In cryptography, analyzing the frequency of individual letters can reveal substitution patterns.
- Bigrams: Pairs of consecutive characters. These reveal relationships between adjacent elements, such as common letter pairs.
- Trigrams: Triplets of consecutive characters. These are useful for identifying common sequences or patterns.
- Quadgrams: Groups of four consecutive characters. These provide deeper insight into language structure and context.
Analysis Approach¶
- Text Preprocessing: We clean the text by removing spaces and punctuation to focus on character-level analysis.
- N-gram Generation: We generate bigrams, trigrams, and quadgrams from the character sequences to analyze the structure and patterns.
- Frequency Calculation: We calculate the frequency distribution of the n-grams to identify common sequences.
- Visualization: We use bar charts and word clouds to highlight the most frequent n-grams in the text.
Insights¶
- Character-based bigrams, trigrams, and quadgrams reveal the text's structure and common sequences.
- The analysis provides insights into linguistic patterns that are valuable for cryptanalysis and text mining.
- Visualizing these patterns can help in recognizing and predicting the structure of the underlying message.
Sample Text Analysis¶
We use a sample text to demonstrate the generation and analysis of n-grams. By calculating the frequency of monograms, bigrams, trigrams, and quadgrams, and visualizing the results, we can highlight common patterns.
- The most frequent monograms indicate the prominence of certain letters.
- Bigrams, trigrams, and quadgrams reveal common pairings and sequences that can be exploited in cryptanalysis.
- Comparing these patterns with known language models helps in deciphering encoded texts.
import spacy
# Load spaCy's English language model
nlp = spacy.load('en_core_web_sm')
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
# Load spaCy's English language model
nlp = spacy.load('en_core_web_sm')
def preprocess_text(text):
"""Convert text to lowercase and remove non-alphabetic characters."""
return ''.join([char for char in text.lower() if char.isalpha()])
def generate_char_ngrams(text, n):
"""Generate n-grams from text at the character level."""
return [text[i:i + n] for i in range(len(text) - n + 1)]
def calculate_frequency(ngrams):
"""Calculate frequency of n-grams."""
return Counter(ngrams)
def plot_ngram_freq(freq_dist, title, top_n=10):
"""Plot the frequency distribution of n-grams."""
common_ngrams = freq_dist.most_common(top_n)
ngrams, frequencies = zip(*common_ngrams)
plt.figure(figsize=(10, 6))
plt.bar(ngrams, frequencies, color='lightcoral')
plt.title(title)
plt.xlabel('N-gram')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
def plot_wordcloud(frequencies, title):
"""Plot a word cloud for the given n-gram frequencies."""
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(frequencies)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title(title)
plt.axis('off')
plt.show()
# Process text using spaCy and preprocess for character n-gram analysis
doc = nlp(sample_text)
clean_text = preprocess_text(doc.text)
# Generate and analyze character-based n-grams
for n in [2, 3, 4]: # Bigrams, trigrams, quadgrams
ngrams_char = generate_char_ngrams(clean_text, n)
freq_char = calculate_frequency(ngrams_char)
# Display the most common character n-grams
print(f"Most common character {n}-grams:", freq_char.most_common(5))
# Plot frequencies and word cloud
plot_ngram_freq(freq_char, f'Character {n}-gram Frequency in Sample Text')
plot_wordcloud(freq_char, f'Character {n}-gram Word Cloud')
Most common character 2-grams: [('he', 48), ('th', 39), ('er', 36), ('ed', 29), ('in', 27)]
Most common character 3-grams: [('the', 20), ('her', 19), ('and', 12), ('she', 8), ('ere', 8)]
Most common character 4-grams: [('ther', 6), ('dthe', 5), ('here', 4), ('edth', 4), ('with', 4)]
Automated Techniques: Algorithms and Mathematics¶
Hill Climbing Algorithm for Substitution Cipher¶
The Hill Climbing algorithm is a local search algorithm used to iteratively improve a solution by making small changes and selecting modifications that enhance the solution according to a specific objective function. In the context of breaking substitution ciphers, it is useful for navigating the large and complex solution space.
Pseudo-code¶
- Start with a random key.
- Calculate the fitness of the decrypted message.
- Make small changes to the key and calculate the new fitness.
- If the new key is better, adopt it; repeat until convergence.
Objective Function¶
The objective function evaluates how well a given decryption matches known linguistic patterns. The goal is to maximize the match between the decrypted text and the expected frequency distributions of letters and n-grams in the target language, such as English.
Mathematical Details¶
Monogram Frequency:
$$ \text{Score}_{\text{monogram}} = \sum_{c \in \text{Alphabet}} \left( \text{Frequency}_{\text{text}}(c) \times \text{ExpectedFrequency}_{\text{English}}(c) \right) $$
Bigram Frequency:
$$ \text{Score}_{\text{bigram}} = \sum_{b \in \text{Bigrams}} \left( \text{Frequency}_{\text{text}}(b) \times \text{ExpectedFrequency}_{\text{English}}(b) \right) $$
Trigram Frequency:
$$ \text{Score}_{\text{trigram}} = \sum_{t \in \text{Trigrams}} \left( \text{Frequency}_{\text{text}}(t) \times \text{ExpectedFrequency}_{\text{English}}(t) \right) $$
The overall score is a weighted sum of these individual scores:
$$ \text{TotalScore} = w_1 \cdot \text{Score}_{\text{monogram}} + w_2 \cdot \text{Score}_{\text{bigram}} + w_3 \cdot \text{Score}_{\text{trigram}} $$
where $w_1, w_2, w_3$ are weights that can be adjusted based on the importance of each frequency type.
Complexity Analysis¶
The complexity of the Hill Climbing algorithm is influenced by the number of iterations, operations performed per iteration, and evaluation of the objective function.
Big O Complexity¶
Initialization:
- Generating an initial key: $O(n)$, where $(n)$ is the length of the alphabet (26 for English).
Per Iteration:
- Generating a new candidate key: $O(1)$, as it involves swapping two letters.
- Decrypting the ciphertext: $O(m)$, where $(m)$ is the length of the ciphertext.
- Scoring the decryption:
- Monogram analysis: $O(m)$
- Bigram analysis: $O(m)$
- Trigram analysis: $O(m)$
Total per iteration complexity: $O(m)$
Overall Complexity:
- Given $(k)$ iterations and each iteration having complexity $O(m)$, the overall complexity is $O(k \times m)$.
Maximum Iterations Without Improvement:
- The algorithm terminates when a maximum number of iterations without improvement is reached, denoted as $(\text{max\_no\_improvement})$.
This adds a constraint on the number of iterations, controlling the runtime.
Practical Considerations¶
- Local Optima: Hill Climbing is prone to getting stuck in local optima, which is why using multiple starting keys and random restarts can be effective strategies.
- Tuning: Adjusting weights for the scoring function and setting appropriate limits for iterations are crucial for balancing performance and runtime.
- Scalability: The approach scales well with text length but may require more sophisticated heuristics or hybrid methods for larger alphabets or more complex ciphers.
This detailed analysis provides a mathematical and computational understanding of the Hill Climbing algorithm as applied to substitution cipher cryptanalysis.
import time
def generate_initial_key(verbose=False):
"""Generate an initial random substitution key."""
alphabet = list(string.ascii_uppercase)
random_key = alphabet[:]
random.shuffle(random_key)
if verbose:
print(f"Generated initial key: {random_key}")
return ''.join(random_key)
def decrypt_with_key(ciphertext, key, verbose=False):
"""Decrypt the ciphertext using the given substitution key."""
alphabet = string.ascii_uppercase
reverse_key_map = {key[i]: alphabet[i] for i in range(len(alphabet))}
plaintext = ''.join(reverse_key_map.get(char, char) for char in ciphertext)
if verbose:
print(f"Decrypted text with key {key}: {plaintext[:60]}...") # Show first 60 characters
return plaintext
def score_decryption(text, w1, w2, w3, verbose=False):
"""Score the decrypted text based on n-gram frequency analysis with given weights."""
score = 0
# Single letter frequency analysis
freq_analysis = Counter(text)
for char, freq in freq_analysis.items():
if char in MONOGRAMS:
score += w1 * MONOGRAMS[char] * freq
# Bigram frequency analysis
bigram_freq = Counter(text[i:i+2] for i in range(len(text) - 1))
for bigram, freq in bigram_freq.items():
if bigram in BIGRAMS:
score += w2 * BIGRAMS[bigram] * freq
# Trigram frequency analysis
trigram_freq = Counter(text[i:i+3] for i in range(len(text) - 2))
for trigram, freq in trigram_freq.items():
if trigram in TRIGRAMS:
score += w3 * TRIGRAMS[trigram] * freq
if verbose:
print(f"Score for text: {score}")
return score
def hill_climbing(ciphertext, initial_key, w1=1, w2=1, w3=1, max_no_improvement=5000, verbose=False):
"""Perform the hill-climbing algorithm to decrypt the substitution cipher."""
if verbose:
print(f"Starting hill climbing with initial key: {initial_key}")
current_key = initial_key
current_decryption = decrypt_with_key(ciphertext, current_key, verbose)
current_score = score_decryption(current_decryption, w1, w2, w3, verbose)
iterations = 0
no_improvement_count = 0
start_time = time.time()
while no_improvement_count < max_no_improvement:
new_key = list(current_key)
i, j = random.sample(range(len(new_key)), 2)
new_key[i], new_key[j] = new_key[j], new_key[i]
new_key = ''.join(new_key)
new_decryption = decrypt_with_key(ciphertext, new_key, verbose)
new_score = score_decryption(new_decryption, w1, w2, w3, verbose)
if new_score > current_score:
current_key = new_key
current_decryption = new_decryption
current_score = new_score
no_improvement_count = 0
if verbose:
print(f"Improved key found with score {current_score}: {current_key}")
else:
no_improvement_count += 1
iterations += 1
end_time = time.time()
duration = end_time - start_time
return current_decryption, current_key, iterations, duration
def decrypt_with_hill_climbing(ciphertext, num_guesses=5, verbose=False):
"""Decrypt the ciphertext using multiple guesses with the hill-climbing algorithm."""
initial_keys = [generate_initial_key(verbose) for _ in range(num_guesses)]
best_decryption = None
best_key = None
best_score = float('-inf')
total_iterations = 0
total_duration = 0
for initial_key in initial_keys:
decryption, key, iterations, duration = hill_climbing(ciphertext, initial_key, verbose=verbose)
score = score_decryption(decryption, 1, 1.25, 1.5, verbose)
total_iterations += iterations
total_duration += duration
if score > best_score:
best_decryption = decryption
best_key = key
best_score = score
avg_iterations = total_iterations // num_guesses
avg_duration = total_duration / num_guesses
print("\nBest Key:", best_key)
print("Decrypted Plaintext:", best_decryption)
print(f"Average Iterations: {avg_iterations}, Average Duration: {avg_duration:.2f} seconds")
return best_decryption, best_key, avg_iterations, avg_duration
# Measure the speed
start_time = time.time()
# Perform decryption using hill climbing
best_decryption, best_key, avg_iterations, avg_duration = decrypt_with_hill_climbing(ciphertext, num_guesses=5)
end_time = time.time()
hill_climb_time = end_time - start_time
# Calculate key accuracy
correct_key_mappings = sum([1 for a, b in zip(original_key, best_key) if a == b])
hill_climb_key_accuracy = (correct_key_mappings / 26) * 100
# Calculate text accuracy
correct_text_mappings = sum([1 for a, b in zip(sample_text, best_decryption) if a == b])
hill_climb_text_accuracy = (correct_text_mappings / len(sample_text)) * 100
print(f"Hill Climbing Key Accuracy: {hill_climb_key_accuracy:.2f}%")
print(f"Hill Climbing Text Accuracy: {hill_climb_text_accuracy:.2f}%")
print(f"Hill Climbing Speed: {hill_climb_time:.2f} seconds")
Best Key: TWPCOLVYBJKHNUEGSQRFMIDXZA Decrypted Plaintext: THUS DETERBINED, SHE RECONNOITRED THE FIELD, AND MRACTISED HER ADDRESS SO SUCCESSFULLY, THAT IN LESS THAN HALF AN HOUR SHE WAS LOADED WITH ERBINE AND EBPROIDERY, AND DISMOSED TO RETREAT WITH HER PURDEN, WHEN HER REGARDS WERE SOLICITED PY A SMLENDID PUNDLE, WHICH SHE DESCRIED AT SOBE DISTANCE LYING ON THE GROUND. THIS WAS NO OTHER THAN AN UNHAMMY OFFICER OF HUSSARS; WHO, AFTER HAVING THE GOOD FORTUNE TO TAKE A TURKISH STANDARD, WAS DESMERATELY WOUNDED IN THE THIGH, AND OPLIGED TO JUIT HIS HORSE; FINDING HIBSELF IN SUCH A HELMLESS CONDITION, HE HAD WRAMMED HIS ACJUISITION ROUND HIS PODY, THAT WHATEVER BIGHT HAMMEN, HE AND HIS GLORY SHOULD NOT PE MARTED; AND THUS SHROUDED, ABONG THE DYING AND THE DEAD, HE HAD OPSERVED THE MROGRESS OF OUR HEROINE, WHO STALKED APOUT THE FIELD, LIKE ANOTHER ATROMOS, FINISHING, WHEREVER SHE CABE, THE WORK OF DEATH. HE DID NOT AT ALL DOUPT, THAT HE HIBSELF WOULD PE VISITED IN THE COURSE OF HER MEREGRINATIONS, AND THEREFORE MROVIDED FOR HER RECEMTION, WITH A MISTOL READY COCKED IN HIS HAND, WHILE HE LAY MERDUE PENEATH HIS COVERT, IN ALL AMMEARANCE PEREFT OF LIFE. HE WAS NOT DECEIVED IN HIS MROGNOSTIC; SHE NO SOONER EYED THE GOLDEN CRESCENT THAN, INFLABED WITH CURIOSITY OR CUMIDITY, SHE DIRECTED THITHERWARD HER STEMS, AND DISCERNING THE CARCASE OF A BAN, FROB WHICH, SHE THOUGHT, THERE WOULD PE A NECESSITY FOR DISENGAGING IT, SHE LIFTED UM HER WEAMON, IN ORDER TO BAKE SURE OF HER MURCHASE; AND IN THE VERY INSTANT OF DISCHARGING HER PLOW, RECEIVED A PRACE OF PULLETS IN HER PRAIN. Average Iterations: 6505, Average Duration: 6.57 seconds Hill Climbing Key Accuracy: 73.08% Hill Climbing Text Accuracy: 96.07% Hill Climbing Speed: 32.89 seconds
Genetic Algorithm for Substitution Cipher¶
The genetic algorithm is a search heuristic that mimics the process of natural selection. It is used to generate high-quality solutions for optimization and search problems by relying on bio-inspired operators such as mutation, crossover, and selection.
Cryptoanalysis of Simple Substitution Ciphers with Genetic Algorithms
Pseudocode¶
The genetic algorithm for breaking a substitution cipher can be summarized with the following pseudocode:
Initialize a population of random keys.
For each generation:
- Evaluate the fitness of each key in the population.
- Select the best-performing keys as parents.
- Create a new population by performing crossover on parents.
- Apply mutation to introduce variation.
- Replace the old population with the new population.
- Track the best key and decryption.
Return the best decryption and key found.
Objective Function¶
The objective function evaluates how well a decrypted text matches known linguistic patterns. It uses n-gram frequency analysis to score the decrypted text based on its similarity to expected English letter frequencies.
- Monogram Frequency:
$$ \text{Score}_{\text{monogram}} = \sum_{c \in \text{Alphabet}} \left( \text{Frequency}_{\text{text}}(c) \times \text{ExpectedFrequency}_{\text{English}}(c) \right) $$
- Bigram Frequency:
$$ \text{Score}_{\text{bigram}} = \sum_{b \in \text{Bigrams}} \left( \text{Frequency}_{\text{text}}(b) \times \text{ExpectedFrequency}_{\text{English}}(b) \right) $$
- Trigram Frequency:
$$ \text{Score}_{\text{trigram}} = \sum_{t \in \text{Trigrams}} \left( \text{Frequency}_{\text{text}}(t) \times \text{ExpectedFrequency}_{\text{English}}(t) \right) $$
The overall score is a weighted sum of these individual scores:
$$ \text{TotalScore} = w_1 \cdot \text{Score}_{\text{monogram}} + w_2 \cdot \text{Score}_{\text{bigram}} + w_3 \cdot \text{Score}_{\text{trigram}} $$
where $(w_1, w_2, w_3)$ are weights that can be adjusted based on the importance of each frequency type.
Complexity Analysis¶
Initialization:
- Generating the initial population of keys: $O(p \times n)$, where $p$ is the population size and $n$ is the length of the alphabet (26 for English).
Fitness Evaluation:
- Decrypting and scoring each key: $O(p \times m)$, where $m$ is the length of the ciphertext.
Crossover and Mutation:
- Crossover: $O(p)$ to combine parent keys and repair duplicates.
- Mutation: $O(p \times n)$ to introduce random changes.
Overall Complexity per Generation:
- $O(p \times m + p \times n)$
Total Complexity:
- Given $g$ generations, the total complexity is $O(g \times (p \times m + p \times n))$.
Practical Considerations¶
Population Diversity: Maintaining diversity in the population is crucial to explore a wide solution space and avoid local optima.
Balance Between Exploration and Exploitation: The crossover and mutation rates should be tuned to balance exploration of new solutions and exploitation of known good solutions.
Convergence: The algorithm should be allowed to run for enough generations to converge on a satisfactory solution but should be monitored to prevent excessive computation.
Scalability: The genetic algorithm is suitable for larger search spaces due to its ability to explore multiple solutions simultaneously, making it effective for complex cryptanalysis tasks.
def crossover(parent1, parent2):
"""Perform crossover between two parent keys to produce a child key."""
size = len(parent1)
crossover_point = random.randint(0, size - 1)
# Create child with half from each parent
child1 = list(parent1[:crossover_point] + parent2[crossover_point:])
child2 = list(parent2[:crossover_point] + parent1[crossover_point:])
# Repair child to ensure it's a valid key
def repair_key(child):
missing_letters = set(string.ascii_uppercase) - set(child)
duplicates = [char for char, count in Counter(child).items() if count > 1]
for i, char in enumerate(child):
if child.count(char) > 1:
replace_char = missing_letters.pop()
duplicates.remove(char)
child[i] = replace_char
return ''.join(child)
return repair_key(child1), repair_key(child2)
def mutate(key, mutation_rate=0.1):
"""Mutate the key by swapping letters based on the mutation rate."""
key_list = list(key)
for i in range(len(key_list)):
if random.random() < mutation_rate:
j = random.randint(0, len(key_list) - 1)
key_list[i], key_list[j] = key_list[j], key_list[i]
return ''.join(key_list)
def select_parents(population, scores, num_parents):
"""Select parents based on their scores using a roulette wheel selection."""
total_score = sum(scores)
selection_probs = [score / total_score for score in scores]
selected_indices = random.choices(range(len(population)), weights=selection_probs, k=num_parents)
return [population[i] for i in selected_indices]
def genetic_algorithm(ciphertext, population_size=100, generations=1000, mutation_rate=0.1, verbose=False):
"""Perform a genetic algorithm to decrypt the substitution cipher."""
# Initialize population
population = [generate_random_key() for _ in range(population_size)]
best_key = None
best_decryption = None
best_score = float('-inf')
for generation in range(generations):
if verbose:
print(f"Generation {generation}")
# Evaluate fitness scores
scores = [score_decryption(decrypt_with_key(ciphertext, key),1, 2, 3) for key in population]
# Track the best solution
for i, score in enumerate(scores):
if score > best_score:
best_score = score
best_key = population[i]
best_decryption = decrypt_with_key(ciphertext, best_key)
if verbose:
print(f"Best score in generation {generation}: {best_score}")
# Select parents
parents = select_parents(population, scores, population_size // 2)
# Generate new population through crossover and mutation
new_population = []
for i in range(0, len(parents), 2):
parent1, parent2 = parents[i], parents[(i + 1) % len(parents)]
child1, child2 = crossover(parent1, parent2)
new_population.append(mutate(child1, mutation_rate))
new_population.append(mutate(child2, mutation_rate))
population = new_population
print("\nFinal Best Key:", best_key)
print("Decrypted Plaintext:", best_decryption)
print(f"Best Score: {best_score}")
return best_decryption, best_key
# Measure the speed
start_time = time.time()
# Perform decryption using genetic algorithm
genetic_decryption, genetic_key = genetic_algorithm(ciphertext, population_size=100, generations=500, mutation_rate=0.1, verbose=False)
end_time = time.time()
genetic_time = end_time - start_time
# Calculate key accuracy
correct_key_mappings = sum([1 for a, b in zip(original_key, genetic_key) if a == b])
genetic_key_accuracy = (correct_key_mappings / 26) * 100
# Calculate text accuracy
correct_text_mappings = sum([1 for a, b in zip(sample_text, genetic_decryption) if a == b])
genetic_text_accuracy = (correct_text_mappings / len(sample_text)) * 100
print(f"Genetic Algorithm Key Accuracy: {genetic_key_accuracy:.2f}%")
print(f"Genetic Algorithm Text Accuracy: {genetic_text_accuracy:.2f}%")
print(f"Genetic Algorithm Speed: {genetic_time:.2f} seconds")
Final Best Key: OMICFAPXUJNZVYTWSBQEHLKDRG Decrypted Plaintext: ENBY DAEASPRIAD, YNA SAGTIITRESAD ENA VRAUD, OID KSOGERYAD NAS ODDSAYY YT YBGGAYYVBUUL, ENOE RI UAYY ENOI NOUV OI NTBS YNA XOY UTODAD XREN ASPRIA OID APZSTRDASL, OID DRYKTYAD ET SAESAOE XREN NAS ZBSDAI, XNAI NAS SAMOSDY XASA YTURGREAD ZL O YKUAIDRD ZBIDUA, XNRGN YNA DAYGSRAD OE YTPA DRYEOIGA ULRIM TI ENA MSTBID. ENRY XOY IT TENAS ENOI OI BINOKKL TVVRGAS TV NBYYOSY; XNT, OVEAS NOCRIM ENA MTTD VTSEBIA ET EOWA O EBSWRYN YEOIDOSD, XOY DAYKASOEAUL XTBIDAD RI ENA ENRMN, OID TZURMAD ET JBRE NRY NTSYA; VRIDRIM NRPYAUV RI YBGN O NAUKUAYY GTIDRERTI, NA NOD XSOKKAD NRY OGJBRYRERTI STBID NRY ZTDL, ENOE XNOEACAS PRMNE NOKKAI, NA OID NRY MUTSL YNTBUD ITE ZA KOSEAD; OID ENBY YNSTBDAD, OPTIM ENA DLRIM OID ENA DAOD, NA NOD TZYASCAD ENA KSTMSAYY TV TBS NASTRIA, XNT YEOUWAD OZTBE ENA VRAUD, URWA OITENAS OESTKTY, VRIRYNRIM, XNASACAS YNA GOPA, ENA XTSW TV DAOEN. NA DRD ITE OE OUU DTBZE, ENOE NA NRPYAUV XTBUD ZA CRYREAD RI ENA GTBSYA TV NAS KASAMSRIOERTIY, OID ENASAVTSA KSTCRDAD VTS NAS SAGAKERTI, XREN O KRYETU SAODL GTGWAD RI NRY NOID, XNRUA NA UOL KASDBA ZAIAOEN NRY GTCASE, RI OUU OKKAOSOIGA ZASAVE TV URVA. NA XOY ITE DAGARCAD RI NRY KSTMITYERG; YNA IT YTTIAS ALAD ENA MTUDAI GSAYGAIE ENOI, RIVUOPAD XREN GBSRTYREL TS GBKRDREL, YNA DRSAGEAD ENRENASXOSD NAS YEAKY, OID DRYGASIRIM ENA GOSGOYA TV O POI, VSTP XNRGN, YNA ENTBMNE, ENASA XTBUD ZA O IAGAYYREL VTS DRYAIMOMRIM RE, YNA URVEAD BK NAS XAOKTI, RI TSDAS ET POWA YBSA TV NAS KBSGNOYA; OID RI ENA CASL RIYEOIE TV DRYGNOSMRIM NAS ZUTX, SAGARCAD O ZSOGA TV ZBUUAEY RI NAS ZSORI. Best Score: 350985523019 Genetic Algorithm Key Accuracy: 3.85% Genetic Algorithm Text Accuracy: 25.49% Genetic Algorithm Speed: 25.89 seconds
Cryptanalysis of Simple Substitution-Permutation Cipher Using Artificial Neural Network¶
The method described in the 2020 IEEE paper involves using artificial neural networks (ANN) to decrypt substitution-permutation ciphers. This approach leverages the pattern recognition capabilities of neural networks to predict the key used in encryption, thereby recovering the plaintext from ciphertext.
Cryptanalysis of Simple Substitution-Permutation Cipher Using Artificial Neural Networks
Overview¶
The goal is to train a neural network to recognize patterns in ciphertexts and map them to their corresponding plaintexts. The network learns these mappings by analyzing a large set of training data consisting of plaintext-ciphertext pairs. Once trained, the network can predict plaintexts for new, unseen ciphertexts.
Methodology¶
Data Preparation:
- Generate a dataset of plaintext-ciphertext pairs using a known key for encryption.
- Convert text data into numerical form suitable for neural network input and output.
Neural Network Design:
- Construct a neural network with an input layer representing the ciphertext and an output layer predicting the plaintext.
- Use hidden layers to learn complex patterns and relationships.
Training the Neural Network:
- Train the network using the generated dataset to minimize the loss between predicted plaintexts and actual plaintexts.
- Utilize a suitable loss function (e.g., categorical cross-entropy) and an optimizer (e.g., Adam) for training.
Decrypting Ciphertext:
- Input a ciphertext into the trained network to obtain a predicted plaintext.
Pseudocode¶
Initialize Dataset:
- Generate random plaintexts and corresponding ciphertexts using a known key.
- Convert text to numerical data.
Design Neural Network:
- Input layer size = Length of ciphertext.
- Hidden layers: Experiment with different sizes and activations.
- Output layer size = Length of plaintext.
Train Neural Network:
- Use training data to optimize the network weights.
- Evaluate performance on a validation set.
Decrypt New Ciphertext:
- Input ciphertext into the trained network.
- Output predicted plaintext.
Complexity Analysis¶
Data Generation:
- Generating the training dataset: $O(s \times n)$, where $(s)$ is the number of samples and $(n)$ is the length of the text.
Training Complexity:
- Forward pass and backpropagation per epoch: $(O(e \times (n \times d + d^2)))$, where $(e)$ is the number of epochs and $(d)$ is the number of neurons in the hidden layers.
Inference Complexity:
- Forward pass for decryption: $(O(n \times d))$
Practical Considerations¶
Data Requirements: The network's success depends on having a large and diverse training dataset. More data helps in learning robust patterns.
Network Architecture: The choice of the number of layers, neurons, and activation functions can significantly affect performance. Experimentation and tuning are often required.
Generalization: While neural networks are powerful, they may not generalize well to all unseen ciphertexts, especially if training data is not representative.
Scalability: Neural networks can be computationally expensive to train, especially for large datasets or complex architectures. However, they can efficiently handle large volumes of data and complex patterns once trained.
Conclusion¶
Using neural networks for cryptanalysis offers a promising approach to breaking substitution ciphers by leveraging pattern recognition and learning from data. This method provides a novel perspective on cryptanalysis by combining traditional techniques with modern machine learning approaches.
import pandas as pd
import numpy as np
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pack_padded_sequence, pad_sequence, pad_packed_sequence
from sklearn.model_selection import train_test_split
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
'''
# Load the data
df = pd.read_csv('data/encrypted_sentences.csv')
# Separate inputs and outputs
input_texts = df['Encrypted_Sentence'].values
output_texts = df['Sentence'].values
# Tokenization
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
# Tokenize and build vocabulary
def yield_tokens(data_iter):
for text in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(input_texts), min_freq=3, specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])
# Encode and pad sequences
def encode_and_pad(texts, vocab):
tokenized = [torch.tensor(vocab(tokenizer(text)), dtype=torch.long) for text in texts]
return pad_sequence(tokenized, batch_first=True, padding_value=vocab["<pad>"])
input_encoded = encode_and_pad(input_texts, vocab)
output_encoded = encode_and_pad(output_texts, vocab)
# Custom Dataset
class TextDataset(Dataset):
def __init__(self, input_data, output_data):
self.input_data = input_data
self.output_data = output_data
def __len__(self):
return len(self.input_data)
def __getitem__(self, idx):
return self.input_data[idx], self.output_data[idx]
# Split dataset
train_inputs, valid_inputs, train_outputs, valid_outputs = train_test_split(
input_encoded, output_encoded, test_size=0.2, random_state=42
)
train_dataset = TextDataset(train_inputs, train_outputs)
valid_dataset = TextDataset(valid_inputs, valid_outputs)
# Collate function to handle variable-length sequences
def collate_fn(batch):
inputs, outputs = zip(*batch)
input_lengths = torch.tensor([len(seq) for seq in inputs])
output_lengths = torch.tensor([len(seq) for seq in outputs])
inputs_padded = pad_sequence(inputs, batch_first=True, padding_value=vocab["<pad>"])
outputs_padded = pad_sequence(outputs, batch_first=True, padding_value=vocab["<pad>"])
return inputs_padded, input_lengths, outputs_padded, output_lengths
# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(valid_dataset, batch_size=64, shuffle=False, collate_fn=collate_fn)
# Define the model
class LSTMModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=True, dropout=dropout, batch_first=True)
self.fc1 = nn.Linear(hidden_dim * 2, output_dim)
def forward(self, text, text_lengths):
embedded = self.embedding(text)
packed_embedded = pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True, enforce_sorted=False)
packed_output, (hidden, cell) = self.rnn(packed_embedded)
output, _ = pad_packed_sequence(packed_output, batch_first=True)
output = self.fc1(output)
return output
model = LSTMModel(len(vocab), 128, 64, len(vocab), 2)
# Loss and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=vocab["<pad>"])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Training loop
epochs = 100
for epoch in range(epochs):
model.train()
for batch in train_loader:
input_text, input_lengths, output_text, output_lengths = batch
optimizer.zero_grad()
output = model(input_text, input_lengths)
# Reshape the output and target tensors for the loss calculation
output = output.view(-1, output.size(-1))
target = output_text.view(-1)
loss = criterion(output, target)
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item()}")
# Save the model
torch.save(model.state_dict(), 'model/lstm_sentence_decryption_model.pt')
print("Model training complete and saved as 'sentence_decryption_model.pt'.")
'''
'\n\n# Load the data\ndf = pd.read_csv(\'data/encrypted_sentences.csv\')\n\n# Separate inputs and outputs\ninput_texts = df[\'Encrypted_Sentence\'].values\noutput_texts = df[\'Sentence\'].values\n\n# Tokenization\ntokenizer = get_tokenizer(\'spacy\', language=\'en_core_web_sm\')\n\n# Tokenize and build vocabulary\ndef yield_tokens(data_iter):\n for text in data_iter:\n yield tokenizer(text)\n\nvocab = build_vocab_from_iterator(yield_tokens(input_texts), min_freq=3, specials=["<unk>", "<pad>"])\nvocab.set_default_index(vocab["<unk>"])\n\n# Encode and pad sequences\ndef encode_and_pad(texts, vocab):\n tokenized = [torch.tensor(vocab(tokenizer(text)), dtype=torch.long) for text in texts]\n return pad_sequence(tokenized, batch_first=True, padding_value=vocab["<pad>"])\n\ninput_encoded = encode_and_pad(input_texts, vocab)\noutput_encoded = encode_and_pad(output_texts, vocab)\n\n# Custom Dataset\nclass TextDataset(Dataset):\n def __init__(self, input_data, output_data):\n self.input_data = input_data\n self.output_data = output_data\n\n def __len__(self):\n return len(self.input_data)\n\n def __getitem__(self, idx):\n return self.input_data[idx], self.output_data[idx]\n\n# Split dataset\ntrain_inputs, valid_inputs, train_outputs, valid_outputs = train_test_split(\n input_encoded, output_encoded, test_size=0.2, random_state=42\n)\n\ntrain_dataset = TextDataset(train_inputs, train_outputs)\nvalid_dataset = TextDataset(valid_inputs, valid_outputs)\n\n# Collate function to handle variable-length sequences\ndef collate_fn(batch):\n inputs, outputs = zip(*batch)\n input_lengths = torch.tensor([len(seq) for seq in inputs])\n output_lengths = torch.tensor([len(seq) for seq in outputs])\n \n inputs_padded = pad_sequence(inputs, batch_first=True, padding_value=vocab["<pad>"])\n outputs_padded = pad_sequence(outputs, batch_first=True, padding_value=vocab["<pad>"])\n \n return inputs_padded, input_lengths, outputs_padded, output_lengths\n\n# Create dataloaders\ntrain_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=collate_fn)\nvalid_loader = DataLoader(valid_dataset, batch_size=64, shuffle=False, collate_fn=collate_fn)\n\n# Define the model\nclass LSTMModel(nn.Module):\n def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout=0.5):\n super().__init__()\n self.embedding = nn.Embedding(vocab_size, embedding_dim)\n self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=True, dropout=dropout, batch_first=True)\n self.fc1 = nn.Linear(hidden_dim * 2, output_dim)\n\n def forward(self, text, text_lengths):\n embedded = self.embedding(text)\n packed_embedded = pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True, enforce_sorted=False)\n packed_output, (hidden, cell) = self.rnn(packed_embedded)\n output, _ = pad_packed_sequence(packed_output, batch_first=True)\n output = self.fc1(output)\n return output\n\nmodel = LSTMModel(len(vocab), 128, 64, len(vocab), 2)\n\n# Loss and optimizer\ncriterion = nn.CrossEntropyLoss(ignore_index=vocab["<pad>"])\noptimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n\n# Training loop\nepochs = 100\nfor epoch in range(epochs):\n model.train()\n for batch in train_loader:\n input_text, input_lengths, output_text, output_lengths = batch\n\n optimizer.zero_grad()\n output = model(input_text, input_lengths)\n\n # Reshape the output and target tensors for the loss calculation\n output = output.view(-1, output.size(-1))\n target = output_text.view(-1)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step()\n\n print(f"Epoch {epoch + 1}, Loss: {loss.item()}")\n\n# Save the model\ntorch.save(model.state_dict(), \'model/lstm_sentence_decryption_model.pt\')\n\nprint("Model training complete and saved as \'sentence_decryption_model.pt\'.")\n'
import pandas as pd
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
# Load the data to build the same vocabulary
df = pd.read_csv('data/encrypted_sentences.csv')
input_texts = df['Encrypted_Sentence'].values
# Tokenization
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
# Tokenize and build vocabulary
def yield_tokens(data_iter):
for text in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(input_texts), min_freq=3, specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])
# Define the model class (should match the definition during training)
class LSTMModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=True, dropout=dropout,
batch_first=True)
self.fc1 = nn.Linear(hidden_dim * 2, output_dim)
def forward(self, text, text_lengths):
embedded = self.embedding(text)
packed_embedded = pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True, enforce_sorted=False)
packed_output, (hidden, cell) = self.rnn(packed_embedded)
output, _ = pad_packed_sequence(packed_output, batch_first=True)
output = self.fc1(output)
return output
# Initialize the model with the correct vocabulary size
model = LSTMModel(len(vocab), 128, 64, len(vocab), 2)
# Load the trained model weights
model.load_state_dict(torch.load('model/lstm_sentence_decryption_model.pt'))
print("Model loaded successfully.")
Model loaded successfully.
# Decrypt using the trained model
def decrypt_with_model(ciphertext, model, vocab, device='cpu'):
# Tokenization
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
tokens = tokenizer(ciphertext)
# Convert tokens to numerical IDs using the vocabulary
numeric_input = torch.tensor([vocab[token] for token in tokens], dtype=torch.long).unsqueeze(0)
# Pad sequence to ensure consistent input size
text_length = torch.tensor([len(numeric_input[0])])
# Move model and data to the appropriate device
model = model.to(device)
numeric_input = numeric_input.to(device)
text_length = text_length.to(device)
# Set model to evaluation mode
model.eval()
with torch.no_grad():
output = model(numeric_input, text_length)
# Get the predicted tokens
predicted_indices = output.argmax(dim=-1).squeeze().cpu().numpy()
predicted_tokens = [vocab.lookup_token(index) for index in predicted_indices]
# Convert predicted tokens back to text
predicted_text = ' '.join(predicted_tokens)
return predicted_text
from torchtext.vocab import vocab as Vocab
# Dummy example to create the vocabulary if not defined
# In real use, load the vocabulary used during training
counter = Counter()
vocab = Vocab(counter)
vocab.insert_token("<unk>", 0)
vocab.insert_token("<pad>", 1)
vocab.set_default_index(vocab["<unk>"])
# Measure the speed
start_time = time.time()
# Perform decryption using the neural network
neural_decryption = decrypt_with_model(ciphertext, model, vocab)
end_time = time.time()
neural_time = end_time - start_time
# Calculate text accuracy
correct_text_mappings = sum([1 for a, b in zip(sample_text, neural_decryption) if a == b])
neural_text_accuracy = (correct_text_mappings / len(sample_text)) * 100
print(f"Neural Network Text Accuracy: {neural_text_accuracy:.2f}%")
print(f"Neural Network Speed: {neural_time:.2f} seconds")
Neural Network Text Accuracy: 2.03% Neural Network Speed: 0.56 seconds
Performance Analysis and Insights¶
This project explores the implementation of three cryptanalysis techniques—Hill Climbing, Genetic Algorithms, and Artificial Neural Networks (ANNs)—to decrypt substitution ciphers. Here are some key insights and performance metrics from the experiment:
Model Performance¶
Hill Climbing¶
Key Accuracy: Achieved a key accuracy of approximately 76.9%, indicating its effectiveness in finding close matches to the original key.
Text Accuracy: High text accuracy of 96.1% shows its ability to decrypt ciphertexts accurately.
Speed: The method is relatively slow, with an execution time of about 33.4 seconds, reflecting its iterative nature.
Genetic Algorithms¶
Key Accuracy: The genetic algorithm had a key accuracy of about 7.7%, showing difficulty in finding the exact key but indicating some success in partial recovery.
Text Accuracy: Achieved a text accuracy of 36.2%, suggesting moderate effectiveness in decryption compared to Hill Climbing.
Speed: Faster than Hill Climbing with a speed of 24.9 seconds, benefiting from population diversity and exploration strategies.
Neural Networks¶
Key Accuracy: The neural network was not designed to recover keys directly, thus showing 0% key accuracy.
Text Accuracy: Achieved a text accuracy of 2.0%, indicating limited success in generalizing from the training data to new ciphertexts.
Speed: Significantly faster with an inference time of just 0.56 seconds, making it ideal for quick decryption tasks once trained.
Key Factors Affecting Performance¶
Data Quality and Quantity: The neural network's low accuracy underscores the importance of training data quality and quantity. More diverse and representative datasets could improve performance.
Algorithm Parameters: For genetic algorithms, parameters such as population size, mutation rate, and number of generations are crucial for achieving better results.
Complexity and Generalization: While neural networks offer fast inference, their complexity requires careful tuning and larger datasets to generalize effectively.
Limitations¶
Local Optima: Hill Climbing can get stuck in local optima, affecting its ability to find the global solution, though it performed well in this experiment.
Training Data Dependency: The neural network's performance is heavily dependent on the quality of training data, limiting its effectiveness with unfamiliar ciphertexts.
Conclusions and Future Directions¶
The combined use of Hill Climbing, Genetic Algorithms, and Artificial Neural Networks for cryptanalysis provides a comprehensive approach to tackling classical encryption challenges. Here are some concluding thoughts and future directions:
Conclusions¶
Efficacy: Hill Climbing demonstrated high efficacy in this project with strong key and text accuracy, while Genetic Algorithms provided moderate results, and Neural Networks struggled due to data limitations.
Scalability: Hill Climbing and Genetic Algorithms can be scaled with additional iterations and generations, while Neural Networks require enhanced datasets for better scaling.
Practical Applications: Despite challenges, these models can be integrated into cryptanalysis tools, offering automated solutions that reduce manual intervention.
Future Directions¶
Enhancing Models: Improve training datasets for neural networks and explore more sophisticated architectures to capture complex patterns.
Hybrid Approaches: Combine the strengths of Hill Climbing and Genetic Algorithms with neural networks to enhance decryption capabilities.
Transfer Learning: Apply transfer learning techniques to leverage pretrained models on large datasets, improving performance on new datasets.
Variable Length Adaptation: Develop models capable of handling variable-length ciphertexts to increase flexibility and applicability to real-world scenarios.
Security Implications: Understanding the capabilities and limitations of these approaches can inform the design of more secure encryption systems resistant to attacks.
This project highlights the intersection of machine learning and cryptography, offering promising avenues for research and development in secure communication systems.
# Prepare data for plotting
methods = ['Hill Climbing', 'Genetic Algorithm', 'Neural Network']
key_accuracy = [hill_climb_key_accuracy, genetic_key_accuracy, 0] # 0 for Neural Network as it doesn't recover keys
text_accuracy = [hill_climb_text_accuracy, genetic_text_accuracy, neural_text_accuracy]
speed = [hill_climb_time, genetic_time, neural_time]
# Create subplots for comparison
fig, ax1 = plt.subplots(figsize=(12, 6))
# Bar positions
bar_width = 0.25
index = np.arange(len(methods))
# Bar plot for key accuracy
ax1.bar(index, key_accuracy, bar_width, label='Key Accuracy (%)', color='b', alpha=0.6)
# Bar plot for text accuracy
ax1.bar(index + bar_width, text_accuracy, bar_width, label='Text Accuracy (%)', color='g', alpha=0.6)
# Secondary y-axis for speed
ax2 = ax1.twinx()
ax2.bar(index + 2 * bar_width, speed, bar_width, label='Speed (s)', color='r', alpha=0.6)
# Labels and titles
ax1.set_xlabel('Methods')
ax1.set_ylabel('Accuracy (%)')
ax2.set_ylabel('Speed (s)')
ax1.set_title('Comparison of Cryptanalysis Methods')
ax1.set_xticks(index + bar_width)
ax1.set_xticklabels(methods)
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.tight_layout()
plt.show()
# Create a comparison table for accuracy data
def create_accuracy_table(hill_climb_key_accuracy, genetic_key_accuracy, hill_climb_text_accuracy,
genetic_text_accuracy, neural_text_accuracy, hill_climb_time, genetic_time, neural_time):
methods = ['Hill Climbing', 'Genetic Algorithm', 'Neural Network']
key_accuracy = [hill_climb_key_accuracy, genetic_key_accuracy, 0] # 0 for Neural Network as it doesn't recover keys
text_accuracy = [hill_climb_text_accuracy, genetic_text_accuracy, neural_text_accuracy]
speed = [hill_climb_time, genetic_time, neural_time]
# Gather data into a DataFrame
accuracy_data = {
'Method': methods,
'Key Accuracy (%)': key_accuracy,
'Text Accuracy (%)': text_accuracy,
'Speed (s)': speed
}
# Create a DataFrame
accuracy_df = pd.DataFrame(accuracy_data)
# Display the DataFrame as a table
print("\nComparison Table:")
print(accuracy_df)
# Call the function with the provided values
create_accuracy_table(hill_climb_key_accuracy, genetic_key_accuracy, hill_climb_text_accuracy,
genetic_text_accuracy, neural_text_accuracy, hill_climb_time, genetic_time, neural_time)
Comparison Table: Method Key Accuracy (%) Text Accuracy (%) Speed (s) 0 Hill Climbing 73.076923 96.068152 32.885385 1 Genetic Algorithm 3.846154 25.491481 25.890031 2 Neural Network 0.000000 2.031455 0.555757
print(best_decryption)
THUS DETERBINED, SHE RECONNOITRED THE FIELD, AND MRACTISED HER ADDRESS SO SUCCESSFULLY, THAT IN LESS THAN HALF AN HOUR SHE WAS LOADED WITH ERBINE AND EBPROIDERY, AND DISMOSED TO RETREAT WITH HER PURDEN, WHEN HER REGARDS WERE SOLICITED PY A SMLENDID PUNDLE, WHICH SHE DESCRIED AT SOBE DISTANCE LYING ON THE GROUND. THIS WAS NO OTHER THAN AN UNHAMMY OFFICER OF HUSSARS; WHO, AFTER HAVING THE GOOD FORTUNE TO TAKE A TURKISH STANDARD, WAS DESMERATELY WOUNDED IN THE THIGH, AND OPLIGED TO JUIT HIS HORSE; FINDING HIBSELF IN SUCH A HELMLESS CONDITION, HE HAD WRAMMED HIS ACJUISITION ROUND HIS PODY, THAT WHATEVER BIGHT HAMMEN, HE AND HIS GLORY SHOULD NOT PE MARTED; AND THUS SHROUDED, ABONG THE DYING AND THE DEAD, HE HAD OPSERVED THE MROGRESS OF OUR HEROINE, WHO STALKED APOUT THE FIELD, LIKE ANOTHER ATROMOS, FINISHING, WHEREVER SHE CABE, THE WORK OF DEATH. HE DID NOT AT ALL DOUPT, THAT HE HIBSELF WOULD PE VISITED IN THE COURSE OF HER MEREGRINATIONS, AND THEREFORE MROVIDED FOR HER RECEMTION, WITH A MISTOL READY COCKED IN HIS HAND, WHILE HE LAY MERDUE PENEATH HIS COVERT, IN ALL AMMEARANCE PEREFT OF LIFE. HE WAS NOT DECEIVED IN HIS MROGNOSTIC; SHE NO SOONER EYED THE GOLDEN CRESCENT THAN, INFLABED WITH CURIOSITY OR CUMIDITY, SHE DIRECTED THITHERWARD HER STEMS, AND DISCERNING THE CARCASE OF A BAN, FROB WHICH, SHE THOUGHT, THERE WOULD PE A NECESSITY FOR DISENGAGING IT, SHE LIFTED UM HER WEAMON, IN ORDER TO BAKE SURE OF HER MURCHASE; AND IN THE VERY INSTANT OF DISCHARGING HER PLOW, RECEIVED A PRACE OF PULLETS IN HER PRAIN.
Live Coding Event¶
Project Gutenberg¶
class_text = "From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed’st thy light’s flame with self-substantial fuel, Making a famine where abundance lies, Thyself thy foe, to thy sweet self too cruel: Thou that art now the world’s fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And, tender churl, mak’st waste in niggarding: Pity the world, or else this glutton be, To eat the world’s due, by the grave and thee.".upper()
class_key = generate_random_key()
# Encrypt the plaintext using the random key
class_ciphertext = encrypt_with_key(class_text, class_key)
print("Ciphertext: ", ciphertext)
Ciphertext: FYMR COFOQWBUOC, RYO QOPEUUEBFQOC FYO LBOHC, TUC NQTPFBROC YOQ TCCQORR RE RMPPORRLMHHZ, FYTF BU HORR FYTU YTHL TU YEMQ RYO DTR HETCOC DBFY OQWBUO TUC OWGQEBCOQZ, TUC CBRNEROC FE QOFQOTF DBFY YOQ GMQCOU, DYOU YOQ QOVTQCR DOQO REHBPBFOC GZ T RNHOUCBC GMUCHO, DYBPY RYO CORPQBOC TF REWO CBRFTUPO HZBUV EU FYO VQEMUC. FYBR DTR UE EFYOQ FYTU TU MUYTNNZ ELLBPOQ EL YMRRTQR; DYE, TLFOQ YTIBUV FYO VEEC LEQFMUO FE FTKO T FMQKBRY RFTUCTQC, DTR CORNOQTFOHZ DEMUCOC BU FYO FYBVY, TUC EGHBVOC FE JMBF YBR YEQRO; LBUCBUV YBWROHL BU RMPY T YOHNHORR PEUCBFBEU, YO YTC DQTNNOC YBR TPJMBRBFBEU QEMUC YBR GECZ, FYTF DYTFOIOQ WBVYF YTNNOU, YO TUC YBR VHEQZ RYEMHC UEF GO NTQFOC; TUC FYMR RYQEMCOC, TWEUV FYO CZBUV TUC FYO COTC, YO YTC EGROQIOC FYO NQEVQORR EL EMQ YOQEBUO, DYE RFTHKOC TGEMF FYO LBOHC, HBKO TUEFYOQ TFQENER, LBUBRYBUV, DYOQOIOQ RYO PTWO, FYO DEQK EL COTFY. YO CBC UEF TF THH CEMGF, FYTF YO YBWROHL DEMHC GO IBRBFOC BU FYO PEMQRO EL YOQ NOQOVQBUTFBEUR, TUC FYOQOLEQO NQEIBCOC LEQ YOQ QOPONFBEU, DBFY T NBRFEH QOTCZ PEPKOC BU YBR YTUC, DYBHO YO HTZ NOQCMO GOUOTFY YBR PEIOQF, BU THH TNNOTQTUPO GOQOLF EL HBLO. YO DTR UEF COPOBIOC BU YBR NQEVUERFBP; RYO UE REEUOQ OZOC FYO VEHCOU PQORPOUF FYTU, BULHTWOC DBFY PMQBERBFZ EQ PMNBCBFZ, RYO CBQOPFOC FYBFYOQDTQC YOQ RFONR, TUC CBRPOQUBUV FYO PTQPTRO EL T WTU, LQEW DYBPY, RYO FYEMVYF, FYOQO DEMHC GO T UOPORRBFZ LEQ CBROUVTVBUV BF, RYO HBLFOC MN YOQ DOTNEU, BU EQCOQ FE WTKO RMQO EL YOQ NMQPYTRO; TUC BU FYO IOQZ BURFTUF EL CBRPYTQVBUV YOQ GHED, QOPOBIOC T GQTPO EL GMHHOFR BU YOQ GQTBU.
start_time = time.time()
# Perform decryption using hill climbing
best_decryption, best_key, avg_iterations, avg_duration = decrypt_with_hill_climbing(class_ciphertext, num_guesses=5)
end_time = time.time()
class_time = end_time - start_time
Best Key: XMFLJPUITZCROQVHDBWKGNYASE Decrypted Plaintext: YROM YAIREST WREATURES FE DESIRE INWREASE, THAT THEREPC PEAUTC’S ROSE MIGHT NEVER DIE, PUT AS THE RIBER SHOULD PC TIME DEWEASE, HIS TENDER HEIR MIGHT PEAR HIS MEMORC: PUT THOU WONTRAWTED TO THINE OFN PRIGHT ECES, YEED’ST THC LIGHT’S YLAME FITH SELY-SUPSTANTIAL YUEL, MAKING A YAMINE FHERE APUNDANWE LIES, THCSELY THC YOE, TO THC SFEET SELY TOO WRUEL: THOU THAT ART NOF THE FORLD’S YRESH ORNAMENT, AND ONLC HERALD TO THE GAUDC SBRING, FITHIN THINE OFN PUD PURIEST THC WONTENT, AND, TENDER WHURL, MAK’ST FASTE IN NIGGARDING: BITC THE FORLD, OR ELSE THIS GLUTTON PE, TO EAT THE FORLD’S DUE, PC THE GRAVE AND THEE. Average Iterations: 7316, Average Duration: 3.29 seconds
# Calculate text accuracy
class_text_mappings = sum([1 for a, b in zip(class_text, best_decryption) if a == b])
class_text_accuracy = (class_text_mappings / len(sample_text)) * 100
print(f"Hill Climb (Class) Text Accuracy: {class_text_accuracy:.2f}%")
print(f"Hill Climb (Class) Speed: {class_time:.2f} seconds")
Hill Climb (Class) Text Accuracy: 35.91% Hill Climb (Class) Speed: 16.45 seconds
Q&A Session¶