This code appears to compute feature vectors for an audio...

August 26, 2025 at 09:04 AM

int rnn_compute_frame_features(DenoiseState *st, kiss_fft_cpx *X, kiss_fft_cpx *P,
                                  float *Ex, float *Ep, float *Exp, float *features, const float *in) {
  int i;
  int NB_BANDS = 32;
  float E = 0;
  float Ly[NB_BANDS];
  float p[WINDOW_SIZE];
  float pitch_buf[PITCH_BUF_SIZE>>1];
  int pitch_index;
  float gain;
  float *(pre[1]);
  float follow, logMax;
  rnn_frame_analysis(st, X, Ex, in);
  RNN_MOVE(st->pitch_buf, &st->pitch_buf[FRAME_SIZE], PITCH_BUF_SIZE-FRAME_SIZE);
  RNN_COPY(&st->pitch_buf[PITCH_BUF_SIZE-FRAME_SIZE], in, FRAME_SIZE);
  pre[0] = &st->pitch_buf[0];
  rnn_pitch_downsample(pre, pitch_buf, PITCH_BUF_SIZE, 1);
  rnn_pitch_search(pitch_buf+(PITCH_MAX_PERIOD>>1), pitch_buf, PITCH_FRAME_SIZE,
               PITCH_MAX_PERIOD-3*PITCH_MIN_PERIOD, &pitch_index);
  pitch_index = PITCH_MAX_PERIOD-pitch_index;

  gain = rnn_remove_doubling(pitch_buf, PITCH_MAX_PERIOD, PITCH_MIN_PERIOD,
          PITCH_FRAME_SIZE, &pitch_index, st->last_period, st->last_gain);
  st->last_period = pitch_index;
  st->last_gain = gain;
  for (i=0;i<WINDOW_SIZE;i++)
    p[i] = st->pitch_buf[PITCH_BUF_SIZE-WINDOW_SIZE-pitch_index+i];
  apply_window(p);
  forward_transform(P, p);
  compute_band_energy(Ep, P);
  compute_band_corr(Exp, X, P);
  for (i=0;i<NB_BANDS;i++) Exp[i] = Exp[i]/sqrt(.001+Ex[i]*Ep[i]);
  dct(&features[NB_BANDS], Exp);
  features[2*NB_BANDS] = .01*(pitch_index-300);
  logMax = -2;
  follow = -2;
  for (i=0;i<NB_BANDS;i++) {
    Ly[i] = log10(1e-2+Ex[i]);
    Ly[i] = MAX16(logMax-7, MAX16(follow-1.5, Ly[i]));
    logMax = MAX16(logMax, Ly[i]);
    follow = MAX16(follow-1.5, Ly[i]);
    E += Ex[i];
  }
  if (!TRAINING && E < 0.04) {
    /* If there's no audio, avoid messing up the state. */
    RNN_CLEAR(features, NB_FEATURES);
    return 1;
  }
  dct(features, Ly);
  features[0] -= 12;
  features[1] -= 4;
  return TRAINING && E < 0.1;
}

This code appears to compute feature vectors for an audio frame using techniques common in audio processing and machine learning tasks such as speech enhancement or recognition. Here's what the code does step by step:

Initialization of Variables: Various variables such as E, Ly, p, pitch_buf, and features are prepared to store intermediate results. Constants like NB_BANDS, WINDOW_SIZE, and PITCH_BUF_SIZE define key parameters of the audio processing.
Frame Analysis:
- rnn_frame_analysis analyzes the input frame and computes energy in various frequency bands (Ex). The input appears to be processed using FFT (Fast Fourier Transform).
Pitch Buffer Update:
- The pitch buffer (st->pitch_buf) is updated to include the current frame using a sliding window (RNN_MOVE and RNN_COPY).
Pitch Processing:
- The signal is downsampled using rnn_pitch_downsample to reduce the pitch computation complexity.
- Pitch search is performed with rnn_pitch_search to estimate the fundamental frequency (pitch period) of the signal.
- Doubling of pitch estimates is corrected by applying rnn_remove_doubling, which refines the pitch period and computes a pitch gain.
Reconstruction of Pitch-Synchronous Signal:
- A signal from the pitch period is reconstructed, windowed (apply_window), and transformed (forward_transform) back to the frequency domain (P).
Per-Band Energy and Correlation Computations:
- compute_band_energy calculates the energy in different frequency bands for the current frame.
- compute_band_corr calculates the correlation between the input spectrum (X) and the pitch spectrum (P).
- Normalized pitch correlations (Exp) are computed.
Feature Vector Construction:
- The pitch correlations (Exp) are converted to features using a Discrete Cosine Transform (DCT).
- A feature related to pitch index (features[2*NB_BANDS]) is also included.
- Logarithmic spectral energy (Ly) for each frequency band is computed, with smoothing and clipping applied.
- Total energy is accumulated across bands (E).
Energy Thresholding:
- If the total energy (E) is below certain thresholds and training is not enabled, the features are cleared (RNN_CLEAR) to prevent noisy or silent frames from affecting the model.
Final Feature Extraction:
- Another DCT is applied to the log energy values (Ly) to extract features.
- Biasing adjustments to the first two features (e.g., features[0] -= 12) are applied.
Return Value:
- If in training mode and the energy is low, the function returns a flag indicating this condition. Otherwise, it returns 1 if energy is too low in non-training mode.

Summary

This function extracts feature vectors (features) from an audio frame for processing in an RNN (Recurrent Neural Network) or similar model. It includes:

Spectral analysis (FFT and energy computation),
Pitch estimation and correction,
Per-band feature extraction (including pitch correlation and spectral energy),
Noise handling (energy thresholds).

These features are often used in applications like noise suppression, speech enhancement, or other audio processing tasks.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr