ENH: allow rolling with non-numerical (eg string) data #23002

randomgambit · 2018-10-05T02:27:33Z

Hi the Pandas dream team.

I think it would be nice if rolling could accept strings as well (see https://stackoverflow.com/questions/52657429/rolling-with-string-variables)

With the abundance of textual data nowadays, we want Pandas to stay at the top of the curve!

import pandas as pd
import numpy as np

df = pd.DataFrame({'mytime' : [pd.to_datetime('2018-01-01 14:34:12.340'),
                             pd.to_datetime('2018-01-01 14:34:13.0'),
                             pd.to_datetime('2018-01-01 14:34:15.342'),
                             pd.to_datetime('2018-01-01 14:34:16.42'),
                             pd.to_datetime('2018-01-01 14:34:28.742')],
                    'myvalue' : [1,2,np.NaN,3,1],
                    'mychart' : ['a','b','c','d','e']})

df.set_index('mytime', inplace = True)

df
Out[15]: 
                        mychart  myvalue
mytime                                  
2018-01-01 14:34:12.340       a      1.0
2018-01-01 14:34:13.000       b      2.0
2018-01-01 14:34:15.342       c      NaN
2018-01-01 14:34:16.420       d      3.0
2018-01-01 14:34:28.742       e      1.0

Here I want to concatenate the strings in mychart using the values in the last 2 seconds (not the last two observations).

Unfortunately, both attempts below fail miserably


df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: ' '.join(x), raw = False)
df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: (x + ' ').cumsum(), raw = False)

TypeError: cannot handle this type -> object

What do you think?
Thanks!

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-10-05T03:03:11Z

Related to #18129 and #20244 may be of note. Rolling with object types is not supported in a lot of cases (most ops require or cast to float). IMO can be theoretically supported though may cause conflicts with other issues like #12537

alfonsomhc · 2019-02-20T14:53:21Z

+1 to this!

Even if most ops require a cast to float, apply should work on strings. I have a table with timestamps and strings, and I was hoping to group records with time windows and process the strings using apply and a custom function. While there may be a workaround, this seems to me the most natural way of doing it.

Hopefully this can be fixed at some point :)

alfonsomhc · 2019-02-20T15:29:06Z

In the stackoverflow page mentioned above, it was suggested to use resample as workaround. If rolling on a time-based window, that's a fine workaround. However, if it's rolling on an offset-based window, resample can't help you...

alicelynch · 2019-04-09T13:42:03Z

+1

I came across this issue today when trying to apply aggregates as below.

I managed to find another workaround for my use-case of count unique instances in a time-based rolling window, by mapping each unique string to a numeric value.

It would be great to have a fix for this issue, as my workaround is not scalable for large numbers of distinct strings. Thanks!

df.groupby('group_col').rolling('90d', on='time_col').agg({'num_col': 'sum', 'string_col': pd.Series.nunique})

with exception:

/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    224                 raise TypeError("cannot handle this type -> {0}"
--> 225                                 "".format(values.dtype))
    226 

TypeError: cannot handle this type -> object

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    221             try:
--> 222                 values = _ensure_float64(values)
    223             except (ValueError, TypeError):

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

ValueError: could not convert string to float: '<string>'

...

/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    223             except (ValueError, TypeError):
    224                 raise TypeError("cannot handle this type -> {0}"
--> 225                                 "".format(values.dtype))
    226 
    227         if kill_inf:

TypeError: cannot handle this type -> object

jorisvandenbossche · 2019-06-20T08:03:34Z

Another example where this could be useful:

import random, string

In [101]: strings = np.random.choice([''.join(random.choices(string.ascii_uppercase + string.digits, k=5)) for _ in range(20)], 100) 

In [102]: s = pd.Series(strings, index=pd.date_range("2019-01-01", freq='H', periods=100)) 

In [103]: s.rolling("1D").apply(pd.Series.nunique, raw=False) 
...
TypeError: cannot handle this type -> object

Now, you first need to convert it to numerical values:

In [105]: s.astype('category').cat.codes.rolling('1D').apply(pd.Series.nunique, raw=False)
Out[105]: 
2019-01-01 00:00:00     1.0
2019-01-01 01:00:00     2.0
2019-01-01 02:00:00     2.0
...

ThomasKluiters · 2019-07-18T11:44:22Z

I'll happily pick this issue up, I'll open up a PR soon

xunhui-yu-awx · 2020-02-23T13:24:53Z

Need to find out fist/last on a string column in a rolling window.

mateusccoelho · 2021-02-26T00:51:47Z

Is someone working on this issue? It seems that it has some demand.

I have a similar problem, but with lists:

example = pd.DataFrame({
    'id':[1,1,1,1,2,2],
    'val':[[1], [2], [2,3], [4], [1,2], [1]]
})

I want to count how many different values in the lists each rolling window has. I created a lambda that receives a Series and returns what I want, but it won't work because operations over object columns are not supported.

# this lambda does what I want
count_unique_list_values = lambda x: len(set(list(chain.from_iterable(list(x)))))

# but this doesn't work
example['val'].rolling(3).apply(count_unique_list_values)

I really think using apply is the most logical way to do this given the API we have. Moreover, I can't figure out a solution for this problem, so if someone has an idea it would be very appreciated.

jreback · 2021-02-26T01:36:09Z

you simply factorize and then rolling count

rfilmyer · 2021-11-03T22:10:25Z

To put @jreback's explanation into code, I factorized by creating an index with arbitrary integer IDs, then used those IDs to do the unique counts.

Also, you can use pd.Series.nunique in an apply() as well as your existing lambda (I have not tested which is faster; pd.Series.nunique works fast enough for my needs.

import pandas as pd

example = pd.DataFrame({
    'id':[1,1,1,1,2,2],
    'val':[[1], [2], [2,3], [4], [1,2], [1]]
})

unique_values_index = example["val"]

# Now in order to work with the `reindex`ing we need to assign an arbitrary integer ID. The integer part is important because of limitations to be discussed later.
# First, we create a dataframe where the values are the unique value "index" that we created before. This is a handy way of making a sequential index (remember, the actual "val_id" numbers are arbitrary and order doesn't matter)
unique_values_df = pd.DataFrame({"val": unique_values_index})
unique_values_df.index.name = "val_id"
# Now we have a df with one column named "val" and an index named "val_id"...
# But to work with the reindexing that we do next, we need to actually *flip* those around, so we end up with a `Series` where the index is the unique "val" value column, and the values are the ID numbers
unique_values = pd.Series(data=unique_values_df.index, index=unique_values_df["val"])

# Now we can add a new ID column, after doing the same NA -> "x" transformation, for the same reason as above.
example["val_id"] = unique_values.reindex(example["val"]).values

# Above code is necessary because of a number of shortcomings within pandas:
# 1. A rolling window does not have an `.nunique()` method to count the number of unique values in a rolling window of a groupby - https://github.com./pandas-dev/pandas/issues/26958
# 2. You can only operate on float64 values (and by extension, integers) - https://github.com./pandas-dev/pandas/issues/23002
#
# Now, with all that out of the way, this line 1) creates rolling windows looking at the last *n* entries (an arbitrary value, can be set as high as necessary), and 2) counts the number of values of `val` within those pseudo windows.
example_with_windows = example.groupby("id")["val_id"].rolling(2, min_periods=1).apply(pd.Series.nunique, raw=False)
example_with_windows

robertour · 2021-12-30T11:46:38Z

Here is another example (w/o using rolling) that might be useful. I haven't generalize it to groups either.

example = pd.DataFrame({
    'id':[1,1,1,1,2,2],
    'val':[{1},{2},{2,3},{4},{1,2},{1}]
})

# merge the consecutive sets
def join_sets(s, rolling=2):
    return [set().union(*s[i-rolling_n+1:i+1]) for i in range(len(s))]

example['rolled_sets'] = join_sets(example['val'], rolling=2)

mrucker · 2022-03-16T23:23:48Z

Adding to the list, I have a column of numpy arrays that I'd like to perform rolling group by operations on.

jreback · 2022-03-16T23:31:00Z

Adding to the list, I have a column of numpy arrays that I'd like to perform rolling group by operations on.

pls show a concrete example

mrucker · 2022-03-16T23:40:49Z

Here's a simple abstract example.

df = pd.DataFrame({"vector":[np.array([1,2]),np.array([1,2]),np.array([3,4]), np.array([3,4])], "group":['a','a','b','b']})

df.groupby("group")['vector'].rolling(2).apply(lambda x: np.mean(x, axis=0), raw=True)

In practice, I'm working with a stream of data arriving from various sensors asynchronously. Because the data is arriving asynchronously there isn't an easy way to turn it into row based data without imputing a ton of values. The data also has different dimensions depending on which sensor it is coming from thus why sensor data is stored as an abstract vector instead of semantically consistent columns.

alexanderfrey · 2022-03-28T11:22:54Z

@mrucker Having exactly the same issue. I'm considering a for loop construct now until the rolling+apply combination works for this use case.

mrucker · 2022-03-28T14:42:39Z

@alexanderfrey Thanks! For what it is worth, we ended up finding that support is a little better than we realized. For example, this works:

df.groupby("group")['vector'].apply(lambda x: np.mean(x, axis=0))

I know it's not quite the same, but just pointing it out in case it helps.

For our "work-around" we just ended up not using rolling. We wanted to use rolling to apply a simple noise-reduction filter but ultimately decided the benefit of doing was going to be marginal at best because of additional post-processing that was also happening.

changhiskhan · 2023-03-11T19:35:41Z

+1 on this issue. Would make it easy to do context window processing for NLP data

randomgambit · 2023-11-01T08:53:02Z

A related (more simple) example:

dd = pd.DataFrame({'timestamp' : [pd.to_datetime('2023-01-01 10:00:00'),
                                  pd.to_datetime('2023-01-01 10:00:05'),
                                  pd.to_datetime('2023-01-01 10:00:10'),
                                  pd.to_datetime('2023-01-01 10:00:15'),
                                  pd.to_datetime('2023-01-01 10:00:20'),
                                  pd.to_datetime('2023-01-01 10:00:25')],
                   'myvalue' : [1,2,3,4,5,6]})

#works, of course
dd['roll1'] = dd.rolling(2).myvalue.max()

#does not work
dd['roll2'] = dd.rolling(2).timestamp.max()
DataError: No numeric types to aggregate

Given that internally timestamps are floats, I am surprised why this cannot work. Keeping track of the last timestamp in the rolling dataframe is important to make sure there is no look-ahead bias... What do you think?

adam-1312 · 2024-05-27T12:05:43Z

+1, I also support implementing this, since I needed this functionality, but found out it is not implemented!

tymsoncyferki · 2024-10-01T08:37:01Z

+1 for this issue. I needed it for concatenating strings from preceding rows.

WillAyd added Enhancement Difficulty Intermediate Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding labels Oct 5, 2018

mroeschke mentioned this issue Nov 2, 2018

df.rolling(...).sum() on DataFrame containing string columns returns a TypeError #23467

Closed

jorisvandenbossche changed the title ~~rolling with string variables~~ ENH: allow rolling with non-numerical (eg string) data Jun 20, 2019

This was referenced Jun 21, 2019

Lookback methods (rolling/ewm/etc.) should accept series instead of just int/float #26920

Closed

ENH: Exclude nuisance columns from result of window functions #27044

Merged

ihsansecer mentioned this issue Jul 6, 2019

ERR: Rolling().Apply() on Object-Series should raise internally caught errors #15085

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

benjamin-dirtydirty-be mentioned this issue Jan 17, 2020

Non-fixed/variable size moving windows in .rolling and .ewm #31107

Closed

mroeschke mentioned this issue May 8, 2020

rolling.apply or rolling.agg doesn't work on string columns #20773

Closed

mroeschke added Apply Apply, Aggregate, Transform, Map and removed Needs Discussion Requires discussion from core team before further action labels May 8, 2020

rhshadrach mentioned this issue Aug 16, 2020

BUG: rolling count on string Series and TimeIndex raises an error #35752

Closed

3 tasks

TomAugspurger mentioned this issue Sep 4, 2020

BUG: Unexpected behaviour of rolling with apply on DataFrame #34965

Closed

3 tasks

mroeschke mentioned this issue Sep 29, 2022

BUG: the rolling.Rolling subclass cannot handle string when using apply function #48652

Closed

3 tasks

mroeschke mentioned this issue Dec 6, 2023

BUG: Rolling.apply requires that the result of each operation be numeric #56351

Closed

3 tasks

snitish mentioned this issue Mar 19, 2025

BUG(?): rolling sum with pyarrow types results in float64 instead of preserving integer type #61144

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow rolling with non-numerical (eg string) data #23002

ENH: allow rolling with non-numerical (eg string) data #23002

randomgambit commented Oct 5, 2018

WillAyd commented Oct 5, 2018

alfonsomhc commented Feb 20, 2019

alfonsomhc commented Feb 20, 2019

alicelynch commented Apr 9, 2019

jorisvandenbossche commented Jun 20, 2019

ThomasKluiters commented Jul 18, 2019

xunhui-yu-awx commented Feb 23, 2020

mateusccoelho commented Feb 26, 2021

jreback commented Feb 26, 2021

rfilmyer commented Nov 3, 2021 •

edited

Loading

robertour commented Dec 30, 2021 •

edited

Loading

mrucker commented Mar 16, 2022

jreback commented Mar 16, 2022

mrucker commented Mar 16, 2022 •

edited

Loading

alexanderfrey commented Mar 28, 2022

mrucker commented Mar 28, 2022

changhiskhan commented Mar 11, 2023

randomgambit commented Nov 1, 2023

adam-1312 commented May 27, 2024

tymsoncyferki commented Oct 1, 2024

ENH: allow rolling with non-numerical (eg string) data #23002

ENH: allow rolling with non-numerical (eg string) data #23002

Comments

randomgambit commented Oct 5, 2018

WillAyd commented Oct 5, 2018

alfonsomhc commented Feb 20, 2019

alfonsomhc commented Feb 20, 2019

alicelynch commented Apr 9, 2019

jorisvandenbossche commented Jun 20, 2019

ThomasKluiters commented Jul 18, 2019

xunhui-yu-awx commented Feb 23, 2020

mateusccoelho commented Feb 26, 2021

jreback commented Feb 26, 2021

rfilmyer commented Nov 3, 2021 • edited Loading

robertour commented Dec 30, 2021 • edited Loading

mrucker commented Mar 16, 2022

jreback commented Mar 16, 2022

mrucker commented Mar 16, 2022 • edited Loading

alexanderfrey commented Mar 28, 2022

mrucker commented Mar 28, 2022

changhiskhan commented Mar 11, 2023

randomgambit commented Nov 1, 2023

adam-1312 commented May 27, 2024

tymsoncyferki commented Oct 1, 2024

rfilmyer commented Nov 3, 2021 •

edited

Loading

robertour commented Dec 30, 2021 •

edited

Loading

mrucker commented Mar 16, 2022 •

edited

Loading