This code performs a bioinformatics sequence alignment and statistical analysis...

March 22, 2025 at 06:14 PM

import scipy
from Bio import SeqIO
from Bio.Align import PairwiseAligner
from scipy import stats


dog_breeds_fa = r"C:\Users\jadec\OneDrive\biocomputing\Coursework Project\project_dog_dna\dog_breeds.fa"
mystery_fa = r"C:\Users\jadec\OneDrive\biocomputing\Coursework Project\project_dog_dna\mystery.fa"

def read_mystery_seq(mystery_fa):
    mystery_sequence = SeqIO.read(mystery_fa, "fasta") 
    return str(mystery_sequence.seq)  # Convert to string

def read_dog_breeds(dog_breeds_fa):
    dog_breeds = {}
    for record in SeqIO.parse(dog_breeds_fa, "fasta"):
        description = record.description
        dog_breeds[description] = str(record.seq)
    return dog_breeds

def find_most_similar_breed(mystery_sequence, dog_breeds):
    aligner = PairwiseAligner()
    aligner.mode = "global"
    aligner.match_score = 1
    aligner.mismatch_score = 0
    aligner.open_gap_score = -1
    aligner.extend_gap_score = -1

    top_score = float('-inf')
    most_similar_description = None
    most_similar_sequence = None
    highest_alignment = None
    similarity_scores = []  # Store scores for statistical comparison

    for description, breed_sequence in dog_breeds.items():
        alignments = aligner.align(mystery_sequence, breed_sequence)
        score = alignments[0].score
        similarity_scores.append(score)  # Store for statistical comparison

        if score > top_score:
            top_score = score
            most_similar_description = description
            most_similar_sequence = breed_sequence
            highest_alignment = alignments[0]

    if highest_alignment:
        aligned_seq1 = highest_alignment[0]
        aligned_seq2 = highest_alignment[1]

    print("\nDifferences between 'mystery sequence' and most similar sequence match found:")
    for i, (char1, char2) in enumerate(zip(aligned_seq1, aligned_seq2)):
        if char1 != char2:
            if char1 == "-":
                print(f"Position {i+1}: gap in mystery sequence -> {char2}")
            elif char2 == "-":
                print(f"Position {i+1}: {char1} -> gap in breed sequence")
            else:
                print(f"Position {i+1}: {char1} -> {char2}")

    # **P-value Calculation across the database**
    p_value = stats.ttest_1samp(similarity_scores, top_score).pvalue  # Compare top score to full distribution

    print(f"\nThe closest sequence is: {most_similar_description}")
    print(f"Similarity Score: {top_score}")
    print(f"P-value: {p_value}")

    return most_similar_description, most_similar_sequence, p_value

# Run the analysis
find_most_similar_breed(read_mystery_seq(mystery_fa), read_dog_breeds(dog_breeds_fa))

This code performs a bioinformatics sequence alignment and statistical analysis to identify the most similar DNA sequence in a database of dog breeds to a mystery DNA sequence. Here is a detailed explanation of what the code does:

Imports Modules:
- scipy.stats: Provides statistical functions.
- Bio.SeqIO: Used for reading and parsing sequence data from FASTA files.
- Bio.Align.PairwiseAligner: Provides pairwise sequence alignment functionality.
Reads Input Files:
- dog_breeds.fa: A FASTA file containing DNA sequences from various dog breeds, each with a description.
- mystery.fa: A FASTA file containing a single DNA sequence that is unidentified (the "mystery sequence").
read_mystery_seq Function:
- Reads and parses the mystery sequence from the mystery_fa file (in FASTA format).
- Extracts and returns the sequence as a string.
read_dog_breeds Function:
- Parses the dog_breeds_fa file to read all dog breed sequences and their descriptions.
- Returns a dictionary where the key is the breed description and the value is the DNA sequence as a string.
find_most_similar_breed Function:
- Initialization: Sets up the PairwiseAligner for global sequence alignment with specific scoring parameters:
  - Match score: +1
  - Mismatch score: 0
  - Gap open and extension penalties: -1
- Similarity Search:
  - Iterates through all dog breed sequences.
  - Aligns each breed's sequence with the mystery sequence using pairwise alignment.
  - Records the alignment score for each comparison.
  - Tracks the sequence with the highest alignment score (most similar) and its corresponding description.
- Alignment Differences:
  - Compares the aligned sequences to highlight differences (substitutions, insertions, and deletions).
  - Prints the positions and types of differences.
- Statistical Analysis:
  - Performs a one-sample t-test (ttest_1samp) on the alignment scores with the top score as the sample.
  - Computes a p-value to assess whether the similarity of the top match is statistically significant relative to other scores.
- Output:
  - Prints the most similar dog's description, the similarity score, and the p-value.
  - Returns the breed description, sequence of the most similar breed, and the p-value.
Run the Analysis:
- Reads the mystery sequence and dog breed sequences.
- Calls find_most_similar_breed() to perform the comparison and find the closest matching breed.

Key Results:

Identifies the dog breed with the most similar DNA sequence to the mystery DNA.
Highlights specific differences between the sequences.
Outputs a statistical p-value indicating the significance of the match.

Example Use Case:

This script could be used in a bioinformatics context where the DNA of an unknown dog is analyzed to identify the breed by comparing its sequence to a database of known breeds. The statistical analysis helps quantify the robustness of the match.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr