The provided code defines a function `calculate_summary_stats` that calculates and...

May 18, 2025 at 04:29 AM

def calculate_summary_stats(df):
    numeric_cols = ['Unique Key', 'Incident Zip', 'Latitude', 'Longitude', 'Request_Closing_Time']
    
    print("Numeric columns available for calculations:")
    print(numeric_cols)
    print("\n" + "="*50 + "\n")

    stats_df = pd.DataFrame({
        'Column': numeric_cols,
        'Sum': df[numeric_cols].sum().round(2).values,
        'Mean': df[numeric_cols].mean().round(2).values,
        'Std Dev': df[numeric_cols].std().round(2).values,
        'Skewness': df[numeric_cols].apply(skew).round(2).values,
        'Kurtosis': df[numeric_cols].apply(kurtosis).round(2).values
    })
    
    print("             Summary Statistics:")
    print("="*50)
    print(stats_df.to_string(index=False))
    print("\nNote: Statistics calculated only for numeric columns")


calculate_summary_stats(df_cleaned_data_no_nulls)

The provided code defines a function calculate_summary_stats that calculates and prints summary statistics for specific numeric columns of a Pandas DataFrame. Below is a breakdown of what the code does:

Input Parameter: The function takes one input, df, which is expected to be a Pandas DataFrame containing data.
Specification of Numeric Columns:
- The list numeric_cols defines the columns of the DataFrame to be used for calculations. Specifically: 'Unique Key', 'Incident Zip', 'Latitude', 'Longitude', and 'Request_Closing_Time'.
Print Numeric Columns:
- It prints a title message: "Numeric columns available for calculations:".
- It displays the column names defined in numeric_cols.
- The = separator and a blank line are printed for clarity.
Compute Summary Statistics:
- A new DataFrame stats_df is created to hold summarized statistics for the specified numeric columns:
  - Column: The name of each column taken from numeric_cols.
  - Sum: The sum of values for each of the numeric columns, rounded to two decimal places.
  - Mean: The average value of each numeric column, rounded to two decimal places.
  - Std Dev: The standard deviation for each column, rounded to two decimal places.
  - Skewness: The skewness (asymmetry of the data distribution) for each column, using the scipy.stats.skew function, rounded to two decimal places.
  - Kurtosis: The kurtosis (tailedness of the data distribution) for each column, using the scipy.stats.kurtosis function, rounded to two decimal places.
Print Summary Statistics:
- It prints a header message: "Summary Statistics:".
- Another = separator is printed for formatting.
- The resulting DataFrame stats_df is printed without an index using to_string(index=False).
Note for Numeric Columns:
- At the end, it prints a message indicating that the statistics are calculated only for numeric columns.
Function Call:
- The function is called at the end using calculate_summary_stats(df_cleaned_data_no_nulls). Here, df_cleaned_data_no_nulls is expected to be a DataFrame predefined elsewhere in the code, containing cleaned data with no null values.

Pre-requisites:

The code assumes pandas is imported as pd.
The skew and kurtosis functions, likely from the scipy.stats module, are imported.

Purpose:

This function provides a quick statistical summary (sum, mean, standard deviation, skewness, kurtosis) for specific numeric columns of a DataFrame, making it useful for analyzing the numeric data distribution.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr