This code performs a bioinformatics sequence alignment and statistical analysis...

March 22, 2025 at 06:14 PM

import scipy from Bio import SeqIO from Bio.Align import PairwiseAligner from scipy import stats dog_breeds_fa = r"C:\Users\jadec\OneDrive\biocomputing\Coursework Project\project_dog_dna\dog_breeds.fa" mystery_fa = r"C:\Users\jadec\OneDrive\biocomputing\Coursework Project\project_dog_dna\mystery.fa" def read_mystery_seq(mystery_fa): mystery_sequence = SeqIO.read(mystery_fa, "fasta") return str(mystery_sequence.seq) # Convert to string def read_dog_breeds(dog_breeds_fa): dog_breeds = {} for record in SeqIO.parse(dog_breeds_fa, "fasta"): description = record.description dog_breeds[description] = str(record.seq) return dog_breeds def find_most_similar_breed(mystery_sequence, dog_breeds): aligner = PairwiseAligner() aligner.mode = "global" aligner.match_score = 1 aligner.mismatch_score = 0 aligner.open_gap_score = -1 aligner.extend_gap_score = -1 top_score = float('-inf') most_similar_description = None most_similar_sequence = None highest_alignment = None similarity_scores = [] # Store scores for statistical comparison for description, breed_sequence in dog_breeds.items(): alignments = aligner.align(mystery_sequence, breed_sequence) score = alignments[0].score similarity_scores.append(score) # Store for statistical comparison if score > top_score: top_score = score most_similar_description = description most_similar_sequence = breed_sequence highest_alignment = alignments[0] if highest_alignment: aligned_seq1 = highest_alignment[0] aligned_seq2 = highest_alignment[1] print("\nDifferences between 'mystery sequence' and most similar sequence match found:") for i, (char1, char2) in enumerate(zip(aligned_seq1, aligned_seq2)): if char1 != char2: if char1 == "-": print(f"Position {i+1}: gap in mystery sequence -> {char2}") elif char2 == "-": print(f"Position {i+1}: {char1} -> gap in breed sequence") else: print(f"Position {i+1}: {char1} -> {char2}") # **P-value Calculation across the database** p_value = stats.ttest_1samp(similarity_scores, top_score).pvalue # Compare top score to full distribution print(f"\nThe closest sequence is: {most_similar_description}") print(f"Similarity Score: {top_score}") print(f"P-value: {p_value}") return most_similar_description, most_similar_sequence, p_value # Run the analysis find_most_similar_breed(read_mystery_seq(mystery_fa), read_dog_breeds(dog_breeds_fa))

This code performs a bioinformatics sequence alignment and statistical analysis to identify the most similar DNA sequence in a database of dog breeds to a mystery DNA sequence. Here is a detailed explanation of what the code does:

  1. Imports Modules:

    • scipy.stats: Provides statistical functions.
    • Bio.SeqIO: Used for reading and parsing sequence data from FASTA files.
    • Bio.Align.PairwiseAligner: Provides pairwise sequence alignment functionality.
  2. Reads Input Files:

    • dog_breeds.fa: A FASTA file containing DNA sequences from various dog breeds, each with a description.
    • mystery.fa: A FASTA file containing a single DNA sequence that is unidentified (the "mystery sequence").
  3. read_mystery_seq Function:

    • Reads and parses the mystery sequence from the mystery_fa file (in FASTA format).
    • Extracts and returns the sequence as a string.
  4. read_dog_breeds Function:

    • Parses the dog_breeds_fa file to read all dog breed sequences and their descriptions.
    • Returns a dictionary where the key is the breed description and the value is the DNA sequence as a string.
  5. find_most_similar_breed Function:

    • Initialization: Sets up the PairwiseAligner for global sequence alignment with specific scoring parameters:
      • Match score: +1
      • Mismatch score: 0
      • Gap open and extension penalties: -1
    • Similarity Search:
      • Iterates through all dog breed sequences.
      • Aligns each breed's sequence with the mystery sequence using pairwise alignment.
      • Records the alignment score for each comparison.
      • Tracks the sequence with the highest alignment score (most similar) and its corresponding description.
    • Alignment Differences:
      • Compares the aligned sequences to highlight differences (substitutions, insertions, and deletions).
      • Prints the positions and types of differences.
    • Statistical Analysis:
      • Performs a one-sample t-test (ttest_1samp) on the alignment scores with the top score as the sample.
      • Computes a p-value to assess whether the similarity of the top match is statistically significant relative to other scores.
    • Output:
      • Prints the most similar dog's description, the similarity score, and the p-value.
      • Returns the breed description, sequence of the most similar breed, and the p-value.
  6. Run the Analysis:

    • Reads the mystery sequence and dog breed sequences.
    • Calls find_most_similar_breed() to perform the comparison and find the closest matching breed.

Key Results:

  • Identifies the dog breed with the most similar DNA sequence to the mystery DNA.
  • Highlights specific differences between the sequences.
  • Outputs a statistical p-value indicating the significance of the match.

Example Use Case:

This script could be used in a bioinformatics context where the DNA of an unknown dog is analyzed to identify the breed by comparing its sequence to a database of known breeds. The statistical analysis helps quantify the robustness of the match.

Generate your own explanations
Download our vscode extension
Read other generated explanations

Built by @thebuilderjr
Sponsored by beam analytics
Read our terms and privacy policy
Forked from openai-quickstart-node