This code is a word unscrambler program that uses a...

August 30, 2025 at 09:31 AM

from enum import Enum
import logging 
from string import ascii_lowercase 
import sys 
import random

import bootstrap 
import config

cfg = config.add_commandline_args(
    f"Unscramble! ({__file__})",
    "A fast word unscrambler."
)

cfg.add_argument(
    "--sparsefile",
    help="Path to the sparsefile to populate or query.",
    metavar='FILENAME',
    default="/usr/share/dict/sparse_words",
)
cfg.add_argument(
    "--dictfile",
    help="Path to a file containing language words.",
    metavar='FILENAME',
    default="/usr/share/dict/words"
)
group = cfg.add_mutually_exclusive_group()
group.add_argument(
    '--query',
    '-q',
    type=str,
    help="The scrambled form of a word to ponder."
)
group.add_argument(
    "--populate_destructively",
    help="Discard, overwrite and rebuild the sparsefile using a new random seed.",
    action="store_true"
)

logger = logging.getLogger(name) 
letter_sigs = {} 
sparse = {} 
sparsefile_bits = 38  # 238 = 256Gb sparsefile 
sparsefile_length = 2 ** sparsefile_bits 
sparsefile_mask = sparsefile_length - 1 
length_multiplier = int(sparsefile_length / 100)


def populate_letter_sigs() -> None:
    for x in ascii_lowercase:
        letter_sigs[x] = random.getrandbits(sparsefile_bits)


def compute_word_sig(word: str) -> int:
    sig = (len(word) * length_multiplier) & sparsefile_mask
    pop = {}
    for letter in word:
        if letter in letter_sigs:
            pop[letter] = pop.get(letter, 0) + 1
    for letter in pop:
        factor = (letter_sigs[letter] * pop[letter]) & sparsefile_mask
        sig = sig ^ factor
        logger.debug(
            "(%d x %s) => 0x%x (factor=0x%x)" % (pop[letter], letter, sig, factor)
        )
    return sig


def read_dict_write_sparsefile(dictfile: str, sparsefile: str) -> None:
    with open(dictfile, "r") as f:
        for word in f:
            word = word.replace('\n', '')
            word = word.lower()
            sig = compute_word_sig(word)
            logger.debug("%s => %x" % (word, sig))
            if sig in sparse:
                sparse[sig] += ",%s" % word
            else:
                sparse[sig] = word

    with open(sparsefile, "r+b") as f:
        total = len(sparse)
        count = 0
        for sig in sparse:
            word = bytes(sparse[sig], "ascii")
            count += 1
            logger.debug(
                f"[{count} / {total}]: Write {len(word)} bytes at {sig:x} for \"{word}\""
            )

            # Ensure no collisions
            f.seek(sig)
            for x in range(0, len(word)):
                c = bytes(f.read(1))
                if c != b'\0':
                    logger.error(f"About to clobber {c} @ 0x{f.tell():x}")
                    sys.exit(1)

            # Write this one out
            f.seek(sig)
            f.write(word)


class State(Enum):
    FIRST_CHAR = 1
    SKIP_EMPTY = 2
    ACCUMULATE_NOT_EMPTY = 3


@bootstrap.initialize
def main() -> None:
    seed = 12347
    random.seed(seed)
    populate_letter_sigs()

    if config.config['populate_destructively']:
        print(f"Rebuilding the sparsefile using random seed {seed}")
        read_dict_write_sparsefile(
            config.config['dictfile'],
            config.config['sparsefile']
        )
        sys.exit(0)

    # Otherwise, we're looking up a word...
    assert(config.config['query'] is not None)
    sig = compute_word_sig(config.config['query'])
    with open(config.config['sparsefile']) as f:
        result = ""
        f.seek(sig)
        state = State.FIRST_CHAR
        failed = False

        while True:
            c = f.read(1)
            logger.debug("c='%c', result='%s', failed=%s, state=%s" % (
                c, result, failed, state))
            if state == State.FIRST_CHAR:
                if c == chr(0) or c == '':
                    failed = True
                    print("Lookup failed.  Searching for approximate match...")
                    state = State.SKIP_EMPTY
                else:
                    failed = False
                    result = result + c
                    state = State.ACCUMULATE_NOT_EMPTY
            elif state == State.SKIP_EMPTY:
                if c != chr(0) and c != '':
                    result = result + c
                    state = State.ACCUMULATE_NOT_EMPTY
            elif state == State.ACCUMULATE_NOT_EMPTY:
                if c != chr(0) and c != '':
                    result = result + c
                else:
                    break
    if failed:
        print("Closest next match was: '%s'" % result)
        sys.exit(1)
    else:
        print(result)
        sys.exit(0)


if __name__ == "__main__":
    main()

This code is a word unscrambler program that uses a technique involving a sparse file and hashing-like computations to map words to unique identifiers (signatures) for efficient lookup.

Here's a breakdown of what the code does:

Configuration and Argument Parsing:
- The config setup defines a command-line interface for the program, allowing users to run it with certain options.
  - --sparsefile: Specifies a file path for storing the computed sparse representations of words.
  - --dictfile: Specifies a file containing the dictionary of valid words to use.
  - --query: Lets the user input a scrambled word to search.
  - --populate_destructively: Rebuilds the sparse file, discarding any existing content.
Sparse File and Letter Signatures:
- A sparse file is a large virtual file used to map the words efficiently.
- Each letter of the alphabet is assigned a random hash-like signature (letter_sigs) using random.getrandbits, ensuring a unique mapping for each letter within a specific bit range.
- Words are processed letter by letter to compute their signature (compute_word_sig) using the formula:
```
sig = (length_multiplier * len(word)) XOR random_letter_signatures
```
  This approach ensures that small changes to letters or length significantly affect the signature, reducing collisions.
Rebuilding the Sparse File (--populate_destructively):
- The read_dict_write_sparsefile() function reads the specified dictfile (a dictionary of valid words), computes the unique signature for each word, and writes it to the correct position in the sparse file.
- Before writing to the sparse file, the code checks to ensure no signature collision/clobbering occurs.
Word Lookup (--query):
- If a query (scrambled word) is provided, the program computes its signature using the same hashing mechanism.
- It seeks the sparse file at the computed signature's position, reads back any stored word(s), and checks if the word matches.
- If no exact match is found, it attempts to find the closest match (based on subsequent sparse entries).
Error Handling and States:
- State is an Enum used to manage and track the state of the lookup process:
  - FIRST_CHAR: Indicates the first character read from a sparse file.
  - SKIP_EMPTY: Skips null bytes (empty regions in the file).
  - ACCUMULATE_NOT_EMPTY: Collects non-null bytes to form a word.
Main Program Operation:
- The main() function initializes the random seed and populates letter signatures.
- Depending on the flags, it either:
  - Rebuilds the sparse file.
  - Or queries and unscrambles the input word using the sparsefile lookup.

High-Level Workflow:

Initialization: Configure random seed and arguments. Assign random letter signatures.
Population (optional): Create or rebuild the sparse file by computing each dictionary word's signature and storing it in the sparsefile.
Query (unscramble): Given a scrambled word, compute its signature, look it up in the sparsefile, and return the unscrambled word or the closest match.

Example Use Case:

Populate a sparse file:
```
python unscramble.py --populate_destructively
```
This processes a dictionary and stores hashed representations into the sparse file.
Unscramble a word:
```
python unscramble.py --query "elppa"
```
If "elppa" is a scrambled form of "apple," it will compute the signature and return "apple."

Key Takeaways:

Sparse File: Efficient storage for a large key-value structure using direct file access.
Signature Computation: A hashing-like mechanism provides each word a unique identifier based on its letters and length.
Primary Functionality: Fast unscrambling of words via precomputed hashes.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr

This code is a **word unscrambler** program that uses a...

High-Level Workflow:

Example Use Case:

Key Takeaways:

This code is a word unscrambler program that uses a...