You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wish I had a function that would inspect a DataFrame that has duplicate values and yield, per each group of rows that have a duplicate value, a subset of the input DataFrame featuring only the columns that vary.
Feature Description
fromtypingimportUnionimportpandasaspdclassExtendedDF(pd.DataFrame):
@propertydef_constructor(self):
returnExtendedDFdefinspect_duplicates(self, key_col: str) ->Union[pd.DataFrame, 'ExtendedDF']:
"""Inspects a DataFrame that has duplicate values in the `key_col` column, and yields, per each group of rows that have same `key_col` value, a subset of the input DataFrame featuring only the columns that vary. Args: key_col (str): name of the column with duplicate values Yields: pd.DataFrame: per each group of rows that have same `key_col` value, yields a subset of the input DataFrame featuring only the columns that vary. Examples: Consider a dataset containing ramen ratings with duplicates: ```py df = pd.DataFrame({ 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], 'style': ['cup', 'pack', 'cup', 'cup', 'pack'], 'rating': [4, 3.5, 4, 15, 5], 'col_that_doesnt_change': ['so yummy' for _ in range(5)], 'another_col_that_doesnt_change': ['mmm love it' for _ in range(5)], }) df = ExtendedDF(df) ``` ``` df brand style rating col_that_doesnt_change another_col_that_doesnt_change 0 Yum Yum cup 4.0 so yummy mmm love it 1 Yum Yum pack 3.5 so yummy mmm love it 2 Indomie cup 4.0 so yummy mmm love it 3 Indomie cup 15.0 so yummy mmm love it 4 Indomie pack 5.0 so yummy mmm love it ``` Inspect the duplicates using 'brand' column as the key: ```py print( *df.inspect_duplicates('brand') ) ``` ``` brand rating 2 Indomie 4.0 3 Indomie 15.0 4 Indomie 5.0 brand style rating 0 Yum Yum cup 4.0 1 Yum Yum pack 3.5 ``` Inspect the duplicates using 'style' column as the key: ```py print( *df.inspect_duplicates('style') ) ``` ``` style brand 0 cup Yum Yum 2 cup Indomie 3 cup Indomie style brand rating 1 pack Yum Yum 3.5 4 pack Indomie 5.0 ``` Inspect the duplicates using 'rating' column as the key: ```py print( *df.inspect_duplicates('rating') ) ``` ``` rating brand 0 4.0 Yum Yum 2 4.0 Indomie ``` You can also concatenate everything that is yielded into a single DataFrame: ```py print( pd.concat([ *df.inspect_duplicates('brand') ]) ) ``` ``` brand style rating 0 Yum Yum cup 4.0 1 Yum Yum pack 3.5 2 Indomie NaN 4.0 3 Indomie NaN 15.0 4 Indomie NaN 5.0 ``` """mark_all_dupl_mask=self.duplicated(key_col, keep=False)
df_dupl=self.loc[mark_all_dupl_mask]
forkinset(df_dupl[key_col].values):
sub_df=self.loc[self[key_col] ==k]
mask_eq=sub_df.iloc[0] !=sub_df.iloc[1]
diff_cols=mask_eq.loc[mask_eq].index.valuesyieldsub_df.loc[:, [key_col] +list(diff_cols)]
This doesn't look like a general enough use case to add it to the pandas API. As a result, I'm -1 on this.
You can achieve something like this using .groupby().apply() and a custom lambda function.
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I wish I had a function that would inspect a DataFrame that has duplicate values and yield, per each group of rows that have a duplicate value, a subset of the input DataFrame featuring only the columns that vary.
Feature Description
Alternative Solutions
None
Additional Context
Authors:
The text was updated successfully, but these errors were encountered: