This Python code is a tool for encoding sentences into...
This Python code is a tool for encoding sentences into numerical vectors using a tokenizer, encrypting these vectors, and allowing them to be decrypted back into sentences. It uses the SentencePiece library for tokenization (converting sentences into numerical vector representations) and the cryptography.fernet module for encryption and decryption. Here's a detailed breakdown of the code's functionality:
Key Components and What It Does
-
Sentence Tokenization with SentencePiece:
- The script uses SentencePiece to turn sentences into tokenized vectors (lists of numerical IDs).
- It trains a tokenizer model (
tokenizer.model
) using a predefined corpus of sentences. - When tokenizing, the sentence is encoded into a list of numerical tokens.
-
Encryption:
- The numerical tokens (vectors) obtained from tokenization are serialized into JSON format.
- These tokenized data are encrypted using the Fernet symmetric encryption scheme provided by the cryptography library.
- The encrypted data (ciphertext) is then saved to a file (
1output.txt
).
-
Decryption:
- Reads an encrypted file, decrypts it using the specified encryption key, and retrieves the original vector representation.
- The tokenized vector is then converted back to the original sentence using the trained SentencePiece tokenizer.
-
Encryption Key Handling:
- The encryption key is loaded from a specified file (
key.txt
). - If the key file doesn’t exist during encryption, a new key is generated, saved, and then used for encryption.
- The encryption key is loaded from a specified file (
-
Command-Line Arguments:
- The script can be run in two modes, as specified by command-line flags:
- Encryption Mode (
-1
): Takes user input (a sentence), encodes it into vectors, encrypts it, and saves it to1output.txt
. - Decryption Mode (
-2
): Reads an encrypted file (e.g.,1output.txt
), decrypts it, decodes the vectors back to the original sentence, and saves the result in2output.txt
.
- Encryption Mode (
- Both modes require a key file (
key.txt
).
- The script can be run in two modes, as specified by command-line flags:
Detailed Steps
-
Setup:
- The
setup_tokenizer
function ensures a SentencePiece tokenizer model is available. If thetokenizer.model
file doesn’t exist, the script trains one using a predefined corpus of example sentences.
- The
-
Encrypt Mode (
encrypt_mode
):- Prompts the user to enter a sentence.
- Uses the SentencePiece tokenizer to convert the input sentence into numerical vectors.
- Encrypts these vectors using a key (loaded from or saved to
key.txt
). - Outputs the encrypted data to
1output.txt
.
-
Decrypt Mode (
decrypt_mode
):- Loads the encryption key from the key file (
key.txt
). - Reads encrypted data from the file (like
1output.txt
). - Decrypts the data to retrieve the numerical vectors.
- Converts the vectors back into a readable sentence using the SentencePiece tokenizer.
- Saves the reconstructed sentence to
2output.txt
.
- Loads the encryption key from the key file (
-
Key Generation and Saving:
- If the encryption key file doesn’t exist, the script creates a new key using Fernet's generate_key() function. This ensures the encryption process can proceed securely.
Usage Instructions
The script is designed to be run via the command line with the following options:
-
Encrypt a Sentence:
python vspeech.py -1 -k key.txt
- Prompts you to input a sentence.
- Encrypts it after tokenization and saves the result in
1output.txt
.
-
Decrypt an Encrypted File:
python vspeech.py -2 1output.txt -k key.txt
- Decrypts
1output.txt
using the key fromkey.txt
. - Reconstructs the original sentence and saves it in
2output.txt
.
- Decrypts
Example Workflow
-
Encrypt a Sentence:
- Input: "meet at location alpha at fourteen thirty"
- Tokenized:
[1, 21, 14, 33, 42, 19]
(example tokens) - Encrypted and saved as
1output.txt
.
-
Decrypt the File:
- Reads
1output.txt
. - Decrypts and decodes back to:
"meet at location alpha at fourteen thirty"
.
- Reads
Applications
This script demonstrates:
- Secure storage of sensitive information (like sentences) as encrypted data.
- Compression of text into tokenized vectors for efficient processing.
- A use case for SentencePiece in customizing natural language preprocessing pipelines.