Skip to content

ENH: allow rolling with non-numerical (eg string) data #23002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
randomgambit opened this issue Oct 5, 2018 · 20 comments
Open

ENH: allow rolling with non-numerical (eg string) data #23002

randomgambit opened this issue Oct 5, 2018 · 20 comments
Labels
Apply Apply, Aggregate, Transform, Map Enhancement Window rolling, ewma, expanding

Comments

@randomgambit
Copy link

Hi the Pandas dream team.

I think it would be nice if rolling could accept strings as well (see https://stackoverflow.com/questions/52657429/rolling-with-string-variables)

With the abundance of textual data nowadays, we want Pandas to stay at the top of the curve!

import pandas as pd
import numpy as np

df = pd.DataFrame({'mytime' : [pd.to_datetime('2018-01-01 14:34:12.340'),
                             pd.to_datetime('2018-01-01 14:34:13.0'),
                             pd.to_datetime('2018-01-01 14:34:15.342'),
                             pd.to_datetime('2018-01-01 14:34:16.42'),
                             pd.to_datetime('2018-01-01 14:34:28.742')],
                    'myvalue' : [1,2,np.NaN,3,1],
                    'mychart' : ['a','b','c','d','e']})

df.set_index('mytime', inplace = True)

df
Out[15]: 
                        mychart  myvalue
mytime                                  
2018-01-01 14:34:12.340       a      1.0
2018-01-01 14:34:13.000       b      2.0
2018-01-01 14:34:15.342       c      NaN
2018-01-01 14:34:16.420       d      3.0
2018-01-01 14:34:28.742       e      1.0

Here I want to concatenate the strings in mychart using the values in the last 2 seconds (not the last two observations).

Unfortunately, both attempts below fail miserably


df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: ' '.join(x), raw = False)
df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: (x + ' ').cumsum(), raw = False)

TypeError: cannot handle this type -> object

What do you think?
Thanks!

@WillAyd WillAyd added Enhancement Difficulty Intermediate Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding labels Oct 5, 2018
@WillAyd
Copy link
Member

WillAyd commented Oct 5, 2018

Related to #18129 and #20244 may be of note. Rolling with object types is not supported in a lot of cases (most ops require or cast to float). IMO can be theoretically supported though may cause conflicts with other issues like #12537

@alfonsomhc
Copy link
Contributor

+1 to this!

Even if most ops require a cast to float, apply should work on strings. I have a table with timestamps and strings, and I was hoping to group records with time windows and process the strings using apply and a custom function. While there may be a workaround, this seems to me the most natural way of doing it.

Hopefully this can be fixed at some point :)

@alfonsomhc
Copy link
Contributor

In the stackoverflow page mentioned above, it was suggested to use resample as workaround. If rolling on a time-based window, that's a fine workaround. However, if it's rolling on an offset-based window, resample can't help you...

@alicelynch
Copy link

+1

I came across this issue today when trying to apply aggregates as below.

I managed to find another workaround for my use-case of count unique instances in a time-based rolling window, by mapping each unique string to a numeric value.

It would be great to have a fix for this issue, as my workaround is not scalable for large numbers of distinct strings. Thanks!

df.groupby('group_col').rolling('90d', on='time_col').agg({'num_col': 'sum', 'string_col': pd.Series.nunique})

with exception:

/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    224                 raise TypeError("cannot handle this type -> {0}"
--> 225                                 "".format(values.dtype))
    226 

TypeError: cannot handle this type -> object

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    221             try:
--> 222                 values = _ensure_float64(values)
    223             except (ValueError, TypeError):

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

ValueError: could not convert string to float: '<string>'

...

/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    223             except (ValueError, TypeError):
    224                 raise TypeError("cannot handle this type -> {0}"
--> 225                                 "".format(values.dtype))
    226 
    227         if kill_inf:

TypeError: cannot handle this type -> object

@jorisvandenbossche jorisvandenbossche changed the title rolling with string variables ENH: allow rolling with non-numerical (eg string) data Jun 20, 2019
@jorisvandenbossche
Copy link
Member

Another example where this could be useful:

import random, string

In [101]: strings = np.random.choice([''.join(random.choices(string.ascii_uppercase + string.digits, k=5)) for _ in range(20)], 100) 

In [102]: s = pd.Series(strings, index=pd.date_range("2019-01-01", freq='H', periods=100)) 

In [103]: s.rolling("1D").apply(pd.Series.nunique, raw=False) 
...
TypeError: cannot handle this type -> object

Now, you first need to convert it to numerical values:

In [105]: s.astype('category').cat.codes.rolling('1D').apply(pd.Series.nunique, raw=False)
Out[105]: 
2019-01-01 00:00:00     1.0
2019-01-01 01:00:00     2.0
2019-01-01 02:00:00     2.0
...

@ThomasKluiters
Copy link
Contributor

I'll happily pick this issue up, I'll open up a PR soon

@xunhui-yu-awx
Copy link

Need to find out fist/last on a string column in a rolling window.

@mateusccoelho
Copy link

Is someone working on this issue? It seems that it has some demand.

I have a similar problem, but with lists:

example = pd.DataFrame({
    'id':[1,1,1,1,2,2],
    'val':[[1], [2], [2,3], [4], [1,2], [1]]
})

I want to count how many different values in the lists each rolling window has. I created a lambda that receives a Series and returns what I want, but it won't work because operations over object columns are not supported.

# this lambda does what I want
count_unique_list_values = lambda x: len(set(list(chain.from_iterable(list(x)))))

# but this doesn't work
example['val'].rolling(3).apply(count_unique_list_values)

I really think using apply is the most logical way to do this given the API we have. Moreover, I can't figure out a solution for this problem, so if someone has an idea it would be very appreciated.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2021

you simply factorize and then rolling count

@rfilmyer
Copy link

rfilmyer commented Nov 3, 2021

To put @jreback's explanation into code, I factorized by creating an index with arbitrary integer IDs, then used those IDs to do the unique counts.

Also, you can use pd.Series.nunique in an apply() as well as your existing lambda (I have not tested which is faster; pd.Series.nunique works fast enough for my needs.

import pandas as pd

example = pd.DataFrame({
    'id':[1,1,1,1,2,2],
    'val':[[1], [2], [2,3], [4], [1,2], [1]]
})

unique_values_index = example["val"]

# Now in order to work with the `reindex`ing we need to assign an arbitrary integer ID. The integer part is important because of limitations to be discussed later.
# First, we create a dataframe where the values are the unique value "index" that we created before. This is a handy way of making a sequential index (remember, the actual "val_id" numbers are arbitrary and order doesn't matter)
unique_values_df = pd.DataFrame({"val": unique_values_index})
unique_values_df.index.name = "val_id"
# Now we have a df with one column named "val" and an index named "val_id"...
# But to work with the reindexing that we do next, we need to actually *flip* those around, so we end up with a `Series` where the index is the unique "val" value column, and the values are the ID numbers
unique_values = pd.Series(data=unique_values_df.index, index=unique_values_df["val"])

# Now we can add a new ID column, after doing the same NA -> "x" transformation, for the same reason as above.
example["val_id"] = unique_values.reindex(example["val"]).values

# Above code is necessary because of a number of shortcomings within pandas:
# 1. A rolling window does not have an `.nunique()` method to count the number of unique values in a rolling window of a groupby - https://github.com./pandas-dev/pandas/issues/26958
# 2. You can only operate on float64 values (and by extension, integers) - https://github.com./pandas-dev/pandas/issues/23002
#
# Now, with all that out of the way, this line 1) creates rolling windows looking at the last *n* entries (an arbitrary value, can be set as high as necessary), and 2) counts the number of values of `val` within those pseudo windows.
example_with_windows = example.groupby("id")["val_id"].rolling(2, min_periods=1).apply(pd.Series.nunique, raw=False)
example_with_windows

@robertour
Copy link

robertour commented Dec 30, 2021

Here is another example (w/o using rolling) that might be useful. I haven't generalize it to groups either.

example = pd.DataFrame({
    'id':[1,1,1,1,2,2],
    'val':[{1},{2},{2,3},{4},{1,2},{1}]
})

# merge the consecutive sets
def join_sets(s, rolling=2):
    return [set().union(*s[i-rolling_n+1:i+1]) for i in range(len(s))]

example['rolled_sets'] = join_sets(example['val'], rolling=2)

@mrucker
Copy link

mrucker commented Mar 16, 2022

Adding to the list, I have a column of numpy arrays that I'd like to perform rolling group by operations on.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2022

Adding to the list, I have a column of numpy arrays that I'd like to perform rolling group by operations on.

pls show a concrete example

@mrucker
Copy link

mrucker commented Mar 16, 2022

Here's a simple abstract example.

df = pd.DataFrame({"vector":[np.array([1,2]),np.array([1,2]),np.array([3,4]), np.array([3,4])], "group":['a','a','b','b']})

df.groupby("group")['vector'].rolling(2).apply(lambda x: np.mean(x, axis=0), raw=True)

In practice, I'm working with a stream of data arriving from various sensors asynchronously. Because the data is arriving asynchronously there isn't an easy way to turn it into row based data without imputing a ton of values. The data also has different dimensions depending on which sensor it is coming from thus why sensor data is stored as an abstract vector instead of semantically consistent columns.

@alexanderfrey
Copy link

@mrucker Having exactly the same issue. I'm considering a for loop construct now until the rolling+apply combination works for this use case.

@mrucker
Copy link

mrucker commented Mar 28, 2022

@alexanderfrey Thanks! For what it is worth, we ended up finding that support is a little better than we realized. For example, this works:

df.groupby("group")['vector'].apply(lambda x: np.mean(x, axis=0))

I know it's not quite the same, but just pointing it out in case it helps.

For our "work-around" we just ended up not using rolling. We wanted to use rolling to apply a simple noise-reduction filter but ultimately decided the benefit of doing was going to be marginal at best because of additional post-processing that was also happening.

@changhiskhan
Copy link
Contributor

+1 on this issue. Would make it easy to do context window processing for NLP data

@randomgambit
Copy link
Author

A related (more simple) example:

dd = pd.DataFrame({'timestamp' : [pd.to_datetime('2023-01-01 10:00:00'),
                                  pd.to_datetime('2023-01-01 10:00:05'),
                                  pd.to_datetime('2023-01-01 10:00:10'),
                                  pd.to_datetime('2023-01-01 10:00:15'),
                                  pd.to_datetime('2023-01-01 10:00:20'),
                                  pd.to_datetime('2023-01-01 10:00:25')],
                   'myvalue' : [1,2,3,4,5,6]})

#works, of course
dd['roll1'] = dd.rolling(2).myvalue.max()

#does not work
dd['roll2'] = dd.rolling(2).timestamp.max()
DataError: No numeric types to aggregate

Given that internally timestamps are floats, I am surprised why this cannot work. Keeping track of the last timestamp in the rolling dataframe is important to make sure there is no look-ahead bias... What do you think?

@adam-1312
Copy link

+1, I also support implementing this, since I needed this functionality, but found out it is not implemented!

@tymsoncyferki
Copy link

+1 for this issue. I needed it for concatenating strings from preceding rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Enhancement Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests