The provided Python function `dataReadTableDiscover(jobname, tablename, lake, conn)` performs a...

June 30, 2025 at 08:58 AM

def dataReadTableDiscover(jobname, tablename, lake, conn):
    if tablename.startswith('TEMP_'):
        joinedDF = pd.DataFrame()
        temp_df = pd.DataFrame()
        fe_name = tablename.split('_')[1]
        fe_inputs = ast.literal_eval(conn.get_fe_inputs(jobname, fe_name))
        print('fe_inputs', fe_inputs)
        fe_logic = conn.get_feature_logic_(jobname, fe_name)
        datafiles, dbconnector = conn.get_source(lake)
        table_names = [table_name.split('__')[0] for table_name in fe_inputs]
        print('table_names', table_names)

        # st.write(table_names)
        file_paths = []
        for table in table_names:
            file_path = datafiles + "/" + table + '.csv'
            file_paths.append(file_path)
        for file in file_paths:
            datadf = pd.read_csv(file, sep=',')
            rename_dict = {}
            for each_column in datadf.columns:
                file_name = os.path.basename(file).split('.')[0]
                rename_dict[each_column] = file_name + '__' + each_column
            datadf.rename(columns=rename_dict, inplace=True)
            joinedDF = pd.concat([joinedDF, datadf], axis=1)
        print('joinedDF', joinedDF)
        fe_logic = fe_logic.replace('self.temp_df', 'temp_df')
        azure_embeddings = AzureOpenAIEmbeddings(
            deployment="aicloud-text-embed-ada-002",
            openai_api_key="be67446ab5c846d28461736d6c4928c2",
            azure_endpoint="https://ai-openai-ft.openai.azure.com/",
            chunk_size=1000
        )

        # Logic for converting object to datetime
        # Updated regex pattern to include MM/DD/YYYY
        date_regex = re.compile(
            r"""^(
                (\d{1,4}[-/]\d{1,2}[-/]\d{1,4}) |             # e.g. 2024-01-31 or 31/12/2023
                (\d{1,2}/\d{1,2}/\d{4})                       # e.g. 01/15/2024 or 1/5/2024
            )$""",
            re.VERBOSE
        )
        parsed_dates = []
        # for value in joinedDF["LAND__date"].astype(str):
        columns = joinedDF.columns
        print(f"Joined Df column: {columns}")
        for index, column in enumerate(columns):
            for value in joinedDF[column].astype(str):
                cleaned_value = value.strip()
                if re.match(date_regex, cleaned_value):
                    print("inloop")
                    try:
                        parsed_date = parser.parse(cleaned_value, fuzzy=False)
                        parsed_dates.append(parsed_date)
                        print(f"Date column{index} ::::{column}")
                    except Exception as e:
                        print(f"Error: {e}")
                        # parsed_dates.append(np.nan)
                else:
                    parsed_dates.append(np.nan)

            print("commadata", parsed_dates)

            # Save parsed dates in both date_df and datadf
            joinedDF[column] = parsed_dates
            print("commadate", joinedDF[column])   #correction required for land__date
        # temp_df['year'] = joinedDF['LAND__date'].apply(lambda x: str(x).split("/")[-1])
        # print(f"Temporary DataFrame: \n{temp_df}")
        exec(fe_logic, locals())
        for col in temp_df.columns:
            if 'year' in col.lower():
                temp_df[col] = temp_df[col].apply(lambda x: str(int(x)) if pd.notnull(x) else '')
        return temp_df
        # st.write(temp_df)
        # st.write(file_paths)
        # st.write(fe_name)
        # st.write(fe_inputs)
        # st.code(fe_logic)
    else:
        datafiles, dbconnector = conn.get_source(lake)
        file_name = datafiles + "/" + tablename + '.csv'
        print("file path : ", file_name)
        print("file path : ", st.session_state['delimiter'])
        datadf = pd.read_csv(file_name, nrows=1000, delimiter=st.session_state['delimiter'])
        return datadf

The provided Python function dataReadTableDiscover(jobname, tablename, lake, conn) performs a variety of operations primarily focused on data extraction, processing, and transformation. Here's what the code does step-by-step:

Purpose:

The function appears to read tabular data from storage (e.g., a cloud lake or CSV files), process it based on business rules, and optionally transform the data for further use. The exact behavior differs based on whether the provided tablename starts with "TEMP_" or not.

Steps and Behavior:

1. Check if the `tablename` starts with "TEMP_":

If tablename starts with "TEMP_", the function proceeds with more complex logic to read multiple files and apply feature engineering or custom rules.
Otherwise, it performs a simpler operation to read a single file.

IF `tablename` starts with "TEMP_":

(a) Initialization:

Initializes empty pandas DataFrames (joinedDF and temp_df).
Extracts a feature engineering (FE) name from the table name (fe_name is the substring after "TEMP_").
Retrieves fe_inputs (input tables/features for processing) and fe_logic (custom processing logic) using methods from the provided conn object.
Determines source (e.g., file paths and connections) from the lake using conn.get_source(lake).

(b) Read Multiple Tables:

Extracts the input table names from fe_inputs and constructs corresponding file paths for their .csv files.
Reads each CSV file into a DataFrame and renames all columns to include the file name as a prefix (e.g., column becomes filename__column).

(c) Combine Multiple Tables:

Concatenates all the processed DataFrames along their columns (forming a "wide" table).

(d) Date Parsing:

Uses a regex (date_regex) to detect date-like strings (e.g., YYYY-MM-DD or MM/DD/YYYY) in the combined DataFrame.
Attempts to parse these dates into proper datetime objects and replaces invalid or unmatched date strings with NaN.

(e) Apply Feature Logic:

Executes custom transformations specified by fe_logic using the exec() function. This seems to modify temp_df dynamically based on the logic provided.

(f) Final Processing:

Ensures any "year" columns in temp_df contain string representations of integers (rather than floating-point or NaN values).
Returns the transformed temp_df for downstream use.

ELSE (when `tablename` does not start with "TEMP_"):

(a) Read a Single File:

Retrieves the file path for the specified tablename and reads its contents from CSV format.
Reads only the first 1000 rows for preview purposes, using a delimiter specified in st.session_state.

(b) Return Data:

Returns the DataFrame containing the sampled rows.

Key Operations Highlighted:

Dynamic File Loading:
- Reads either a single file or multiple files, depending on the tablename prefix.
Feature Engineering Setup:
- Dynamically processes input tables/features via custom logic (fe_inputs) and rules (fe_logic) retrieved from conn.
Column Renaming and Concatenation:
- Renames columns for clarity (prefixing with the filename) and combines input DataFrames into a single DataFrame.
Date Parsing:
- Extracts and parses valid date strings from the data using regex and the dateutil.parser module.
Custom Logic Execution:
- Executes external feature logic dynamically via exec(), which makes the code flexible but potentially unsafe.

Potential Issues & Dangers:

Security Risk (Dynamic Execution):
- Using exec() to execute untrusted fe_logic could lead to security vulnerabilities.
Hardcoded API Keys:
- The code includes hardcoded API keys for Azure OpenAI Embeddings, which is a serious security concern.
Global Variable Manipulation:
- The code heavily relies on globals and dynamically evaluates expressions, which complicates traceability and debugging.
DataFrame Processing Logic:
- The code does not reset parsed_dates after processing each column, which can lead to incorrect results during date parsing.

Output:

If tablename starts with "TEMP_", the output is temp_df, a transformed DataFrame based on custom feature engineering logic and processed input data.
Otherwise, the output is datadf, a DataFrame containing the first 1000 rows of the specified table.

This code is a dynamic data processing utility with custom logic execution capabilities, suitable for exploratory data analysis or feature engineering in machine learning pipelines.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr