The provided code contains two primary functions: `dataReadTableDiscover` and `openSampleDataDiscover`....

June 30, 2025 at 06:50 AM

def dataReadTableDiscover(jobname, tablename, lake, conn): if tablename.startswith('TEMP_'): joinedDF = pd.DataFrame() temp_df = pd.DataFrame() fe_name = tablename.split('_')[1] fe_inputs = ast.literal_eval(conn.get_fe_inputs(jobname, fe_name)) print('fe_inputs', fe_inputs) fe_logic = conn.get_feature_logic_(jobname, fe_name) datafiles, dbconnector = conn.get_source(lake) table_names = [table_name.split('__')[0] for table_name in fe_inputs] print('table_names', table_names) # st.write(table_names) file_paths = [] for table in table_names: file_path = datafiles + "/" + table + '.csv' file_paths.append(file_path) for file in file_paths: datadf = pd.read_csv(file, sep=',') rename_dict = {} for each_column in datadf.columns: file_name = os.path.basename(file).split('.')[0] rename_dict[each_column] = file_name + '__' + each_column datadf.rename(columns=rename_dict, inplace=True) joinedDF = pd.concat([joinedDF, datadf], axis=1) print('joinedDF', joinedDF) fe_logic = fe_logic.replace('self.temp_df', 'temp_df') azure_embeddings = AzureOpenAIEmbeddings( deployment="aicloud-text-embed-ada-002", openai_api_key="be67446ab5c846d28461736d6c4928c2", azure_endpoint="https://ai-openai-ft.openai.azure.com/", chunk_size=1000 ) # Logic for converting object to datetime # Updated regex pattern to include MM/DD/YYYY date_regex = re.compile( r"""^( (\d{1,4}[-/]\d{1,2}[-/]\d{1,4}) | # e.g. 2024-01-31 or 31/12/2023 (\d{1,2}/\d{1,2}/\d{4}) # e.g. 01/15/2024 or 1/5/2024 )$""", re.VERBOSE ) parsed_dates = [] for value in joinedDF["LAND__date"].astype(str): cleaned_value = value.strip() if re.match(date_regex, cleaned_value): try: parsed_date = parser.parse(cleaned_value, fuzzy=False) parsed_dates.append(parsed_date) except Exception: parsed_dates.append(np.nan) else: parsed_dates.append(np.nan) print("commadata", parsed_dates) # Save parsed dates in both date_df and datadf joinedDF["LAND__date"] = parsed_dates print("commadate", joinedDF['LAND__date']) # temp_df['year'] = joinedDF['LAND__date'].apply(lambda x: str(x).split("/")[-1]) # print(f"Temporary DataFrame: \n{temp_df}") exec(fe_logic, locals()) for col in temp_df.columns: if 'year' in col.lower(): temp_df[col] = temp_df[col].apply(lambda x: str(int(x)) if pd.notnull(x) else '') return temp_df # st.write(temp_df) # st.write(file_paths) # st.write(fe_name) # st.write(fe_inputs) # st.code(fe_logic) else: datafiles, dbconnector = conn.get_source(lake) file_name = datafiles + "/" + tablename + '.csv' print("file path : ", file_name) print("file path : ", st.session_state['delimiter']) datadf = pd.read_csv(file_name, nrows=1000, delimiter=st.session_state['delimiter']) return datadf def openSampleDataDiscover(job_name, cap_button, conn): prompt_window = Modal(title="Sample Data - Discover", key="p_window_sample_discover_data", padding=0, max_width=1200) if cap_button: prompt_window.open() if prompt_window.is_open(): with prompt_window.container(): df_catlg = conn.get_catalog_list() table_list = df_catlg[df_catlg['catalogName'] == st.session_state['catalog']]['tableName'].tolist() parent_table_list = df_catlg[df_catlg['catalogName'] == st.session_state['catalog']][ 'parent_table'].tolist() table_list.extend(parent_table_list) print("table list in sample data ", table_list) table_list = list(set(table_list)) tablename = st.selectbox(label="", options=table_list, key='catalogtablelist1') df = dataReadTableDiscover(job_name, tablename, st.session_state['lake'], conn) df = df.reset_index(drop=True) # df1 = df.copy() # df1.drop(df1.index.name, axis=1, inplace=True) st.write(df) # st.markdown(df.to_html(index=False), unsafe_allow_html=True) st.write( '<style>div.row-widget.stRadio > div{flex-direction:row;justify-content: space-evenly;}</style>', unsafe_allow_html=True) # st.markdown(df.to_html(index=False), unsafe_allow_html=True)

The provided code contains two primary functions: dataReadTableDiscover and openSampleDataDiscover. Here's what each does:


dataReadTableDiscover Function

This function performs the following tasks:

  1. Check for Temporary Table:

    • If the table name starts with 'TEMP_', it processes it as a temporary table. Otherwise, it treats it as a regular table.
  2. Temporary Table Logic (TEMP_):

    • Extracts the feature name (e.g., from the table name "TEMP_feature1", it splits and takes "feature1").
    • Retrieves feature inputs (fe_inputs) and feature logic (fe_logic) from the database connection (conn) for the specific job and feature name.
    • Loads CSV files for each of the extracted table names, renames columns to include their respective table name as prefixes, and then joins the data from all these tables into a single DataFrame (joinedDF).
    • Processes the "LAND__date" column:
      • Cleans and attempts to convert each value into a valid datetime object using regular expressions. Invalid dates are set to NaN.
      • Updates the "LAND__date" column in joinedDF with the parsed dates.
  3. Feature Logic Execution:

    • Dynamically executes the feature logic (fe_logic) on a temp_df DataFrame.
  4. Post-Processing:

    • Processes any columns referring to 'year' in temp_df, ensuring numeric data where applicable.
    • Returns temp_df, which contains processed data based on the feature logic.
  5. Regular Table Processing:

    • If the table doesn't start with 'TEMP_', it is read directly from a CSV file (with a limited number of rows) using a given delimiter from session state.

openSampleDataDiscover Function

This function handles displaying and interacting with sampled data:

  1. Triggers a Modal Window:

    • Opens a modal for viewing and selecting sample data tables.
  2. Fetch Catalog Information:

    • Retrieves the list of tables from the catalog of the connected database.
    • Combines the tableName and parent_table fields into a single unique list.
  3. User Interaction:

    • Displays a dropdown (st.selectbox) for users to select a table from the unique list of tables.
  4. Process Selected Table:

    • Passes the selected table to the dataReadTableDiscover function to load and process the data.
    • Displays the processed DataFrame (df) in the modal using Streamlit widgets.

Additional Notes:

  • Logic Highlight:

    • Dynamically executing feature logic via exec(fe_logic, locals()) can be risky, as it introduces the potential for security vulnerabilities if fe_logic contains malicious code.
  • AzureOpenAIEmbeddings Object:

    • Initializes an embedding object for a specific OpenAI Azure endpoint. It isn't directly used in the provided code, but likely supports other functionalities.
  • Regex for Dates:

    • Matches a wide range of common date formats (YYYY-MM-DD, MM/DD/YYYY, etc.).
    • Ensures robust handling of dates with fuzzy parsing.
  • CSV File Handling:

    • Reads CSV files dynamically based on input, processes the data, and joins columns with unique prefixes to prevent naming conflicts.
  • Modular Design:

    • The dataReadTableDiscover function is used as a helper inside the openSampleDataDiscover function for data preparation, ensuring separation of logic between data loading and user interaction.

What the Code Does Overall:

The code aims to:

  1. Read, process, and prepare data from specified tables (e.g., temporary, regular) in a flexible way.
  2. Dynamically execute user-defined feature logic on the data.
  3. Provide an interactive interface (via Streamlit) for users to select and view sample data from tables in a data catalog.

This is useful in scenarios like data exploration, feature engineering, and creating customizable views for data scientists or analysts within a Streamlit app.

Generate your own explanations
Download our vscode extension
Read other generated explanations

Built by @thebuilderjr
Sponsored by beam analytics
Read our terms and privacy policy
Forked from openai-quickstart-node