The provided code contains two primary functions: `dataReadTableDiscover` and `openSampleDataDiscover`....

June 30, 2025 at 06:50 AM

def dataReadTableDiscover(jobname, tablename, lake, conn):
    if tablename.startswith('TEMP_'):
        joinedDF = pd.DataFrame()
        temp_df = pd.DataFrame()
        fe_name = tablename.split('_')[1]
        fe_inputs = ast.literal_eval(conn.get_fe_inputs(jobname, fe_name))
        print('fe_inputs', fe_inputs)
        fe_logic = conn.get_feature_logic_(jobname, fe_name)
        datafiles, dbconnector = conn.get_source(lake)
        table_names = [table_name.split('__')[0] for table_name in fe_inputs]
        print('table_names', table_names)

        # st.write(table_names)
        file_paths = []
        for table in table_names:
            file_path = datafiles + "/" + table + '.csv'
            file_paths.append(file_path)
        for file in file_paths:
            datadf = pd.read_csv(file, sep=',')
            rename_dict = {}
            for each_column in datadf.columns:
                file_name = os.path.basename(file).split('.')[0]
                rename_dict[each_column] = file_name + '__' + each_column
            datadf.rename(columns=rename_dict, inplace=True)
            joinedDF = pd.concat([joinedDF, datadf], axis=1)
        print('joinedDF', joinedDF)
        fe_logic = fe_logic.replace('self.temp_df', 'temp_df')
        azure_embeddings = AzureOpenAIEmbeddings(
            deployment="aicloud-text-embed-ada-002",
            openai_api_key="be67446ab5c846d28461736d6c4928c2",
            azure_endpoint="https://ai-openai-ft.openai.azure.com/",
            chunk_size=1000
        )

        # Logic for converting object to datetime
        # Updated regex pattern to include MM/DD/YYYY
        date_regex = re.compile(
            r"""^(
                (\d{1,4}[-/]\d{1,2}[-/]\d{1,4}) |             # e.g. 2024-01-31 or 31/12/2023
                (\d{1,2}/\d{1,2}/\d{4})                       # e.g. 01/15/2024 or 1/5/2024
            )$""",
            re.VERBOSE
        )
        parsed_dates = []
        for value in joinedDF["LAND__date"].astype(str):
            cleaned_value = value.strip()
            if re.match(date_regex, cleaned_value):
                try:
                    parsed_date = parser.parse(cleaned_value, fuzzy=False)
                    parsed_dates.append(parsed_date)
                except Exception:
                    parsed_dates.append(np.nan)
            else:
                parsed_dates.append(np.nan)

        print("commadata", parsed_dates)

        # Save parsed dates in both date_df and datadf
        joinedDF["LAND__date"] = parsed_dates
        print("commadate", joinedDF['LAND__date'])
        # temp_df['year'] = joinedDF['LAND__date'].apply(lambda x: str(x).split("/")[-1])
        # print(f"Temporary DataFrame: \n{temp_df}")
        exec(fe_logic, locals())
        for col in temp_df.columns:
            if 'year' in col.lower():
                temp_df[col] = temp_df[col].apply(lambda x: str(int(x)) if pd.notnull(x) else '')
        return temp_df
        # st.write(temp_df)
        # st.write(file_paths)
        # st.write(fe_name)
        # st.write(fe_inputs)
        # st.code(fe_logic)
    else:
        datafiles, dbconnector = conn.get_source(lake)
        file_name = datafiles + "/" + tablename + '.csv'
        print("file path : ", file_name)
        print("file path : ", st.session_state['delimiter'])
        datadf = pd.read_csv(file_name, nrows=1000, delimiter=st.session_state['delimiter'])
        return datadf


def openSampleDataDiscover(job_name, cap_button, conn):
    prompt_window = Modal(title="Sample Data - Discover", key="p_window_sample_discover_data", padding=0,
                          max_width=1200)
    if cap_button:
        prompt_window.open()
    if prompt_window.is_open():
        with prompt_window.container():
            df_catlg = conn.get_catalog_list()
            table_list = df_catlg[df_catlg['catalogName'] == st.session_state['catalog']]['tableName'].tolist()
            parent_table_list = df_catlg[df_catlg['catalogName'] == st.session_state['catalog']][
                'parent_table'].tolist()
            table_list.extend(parent_table_list)
            print("table list in sample data ", table_list)
            table_list = list(set(table_list))

            tablename = st.selectbox(label="", options=table_list, key='catalogtablelist1')
            df = dataReadTableDiscover(job_name, tablename, st.session_state['lake'], conn)
            df = df.reset_index(drop=True)
            # df1 = df.copy()
            # df1.drop(df1.index.name, axis=1, inplace=True)
            st.write(df)
            # st.markdown(df.to_html(index=False), unsafe_allow_html=True)
            st.write(
                '<style>div.row-widget.stRadio > div{flex-direction:row;justify-content: space-evenly;}</style>',
                unsafe_allow_html=True)
            # st.markdown(df.to_html(index=False), unsafe_allow_html=True)

The provided code contains two primary functions: dataReadTableDiscover and openSampleDataDiscover. Here's what each does:

`dataReadTableDiscover` Function

This function performs the following tasks:

Check for Temporary Table:
- If the table name starts with 'TEMP_', it processes it as a temporary table. Otherwise, it treats it as a regular table.
Temporary Table Logic (TEMP_):
- Extracts the feature name (e.g., from the table name "TEMP_feature1", it splits and takes "feature1").
- Retrieves feature inputs (fe_inputs) and feature logic (fe_logic) from the database connection (conn) for the specific job and feature name.
- Loads CSV files for each of the extracted table names, renames columns to include their respective table name as prefixes, and then joins the data from all these tables into a single DataFrame (joinedDF).
- Processes the "LAND__date" column:
  - Cleans and attempts to convert each value into a valid datetime object using regular expressions. Invalid dates are set to NaN.
  - Updates the "LAND__date" column in joinedDF with the parsed dates.
Feature Logic Execution:
- Dynamically executes the feature logic (fe_logic) on a temp_df DataFrame.
Post-Processing:
- Processes any columns referring to 'year' in temp_df, ensuring numeric data where applicable.
- Returns temp_df, which contains processed data based on the feature logic.
Regular Table Processing:
- If the table doesn't start with 'TEMP_', it is read directly from a CSV file (with a limited number of rows) using a given delimiter from session state.

`openSampleDataDiscover` Function

This function handles displaying and interacting with sampled data:

Triggers a Modal Window:
- Opens a modal for viewing and selecting sample data tables.
Fetch Catalog Information:
- Retrieves the list of tables from the catalog of the connected database.
- Combines the tableName and parent_table fields into a single unique list.
User Interaction:
- Displays a dropdown (st.selectbox) for users to select a table from the unique list of tables.
Process Selected Table:
- Passes the selected table to the dataReadTableDiscover function to load and process the data.
- Displays the processed DataFrame (df) in the modal using Streamlit widgets.

Additional Notes:

Logic Highlight:
- Dynamically executing feature logic via exec(fe_logic, locals()) can be risky, as it introduces the potential for security vulnerabilities if fe_logic contains malicious code.
AzureOpenAIEmbeddings Object:
- Initializes an embedding object for a specific OpenAI Azure endpoint. It isn't directly used in the provided code, but likely supports other functionalities.
Regex for Dates:
- Matches a wide range of common date formats (YYYY-MM-DD, MM/DD/YYYY, etc.).
- Ensures robust handling of dates with fuzzy parsing.
CSV File Handling:
- Reads CSV files dynamically based on input, processes the data, and joins columns with unique prefixes to prevent naming conflicts.
Modular Design:
- The dataReadTableDiscover function is used as a helper inside the openSampleDataDiscover function for data preparation, ensuring separation of logic between data loading and user interaction.

What the Code Does Overall:

The code aims to:

Read, process, and prepare data from specified tables (e.g., temporary, regular) in a flexible way.
Dynamically execute user-defined feature logic on the data.
Provide an interactive interface (via Streamlit) for users to select and view sample data from tables in a data catalog.

This is useful in scenarios like data exploration, feature engineering, and creating customizable views for data scientists or analysts within a Streamlit app.

Generate your own explanations

Download our vscode extension

Read other generated explanations

Built by @thebuilderjr

The provided code contains two primary functions: `dataReadTableDiscover` and `openSampleDataDiscover`....

dataReadTableDiscover Function

openSampleDataDiscover Function

Additional Notes:

What the Code Does Overall:

`dataReadTableDiscover` Function

`openSampleDataDiscover` Function