-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: allow rolling with non-numerical (eg string) data #23002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 to this! Even if most ops require a cast to float, apply should work on strings. I have a table with timestamps and strings, and I was hoping to group records with time windows and process the strings using apply and a custom function. While there may be a workaround, this seems to me the most natural way of doing it. Hopefully this can be fixed at some point :) |
In the stackoverflow page mentioned above, it was suggested to use resample as workaround. If rolling on a time-based window, that's a fine workaround. However, if it's rolling on an offset-based window, resample can't help you... |
+1 I came across this issue today when trying to apply aggregates as below. I managed to find another workaround for my use-case of count unique instances in a time-based rolling window, by mapping each unique string to a numeric value. It would be great to have a fix for this issue, as my workaround is not scalable for large numbers of distinct strings. Thanks!
with exception:
|
Another example where this could be useful:
Now, you first need to convert it to numerical values:
|
I'll happily pick this issue up, I'll open up a PR soon |
Need to find out fist/last on a string column in a rolling window. |
Is someone working on this issue? It seems that it has some demand. I have a similar problem, but with lists: example = pd.DataFrame({
'id':[1,1,1,1,2,2],
'val':[[1], [2], [2,3], [4], [1,2], [1]]
}) I want to count how many different values in the lists each rolling window has. I created a lambda that receives a Series and returns what I want, but it won't work because operations over object columns are not supported. # this lambda does what I want
count_unique_list_values = lambda x: len(set(list(chain.from_iterable(list(x)))))
# but this doesn't work
example['val'].rolling(3).apply(count_unique_list_values) I really think using apply is the most logical way to do this given the API we have. Moreover, I can't figure out a solution for this problem, so if someone has an idea it would be very appreciated. |
you simply factorize and then rolling count |
To put @jreback's explanation into code, I factorized by creating an index with arbitrary integer IDs, then used those IDs to do the unique counts. Also, you can use import pandas as pd
example = pd.DataFrame({
'id':[1,1,1,1,2,2],
'val':[[1], [2], [2,3], [4], [1,2], [1]]
})
unique_values_index = example["val"]
# Now in order to work with the `reindex`ing we need to assign an arbitrary integer ID. The integer part is important because of limitations to be discussed later.
# First, we create a dataframe where the values are the unique value "index" that we created before. This is a handy way of making a sequential index (remember, the actual "val_id" numbers are arbitrary and order doesn't matter)
unique_values_df = pd.DataFrame({"val": unique_values_index})
unique_values_df.index.name = "val_id"
# Now we have a df with one column named "val" and an index named "val_id"...
# But to work with the reindexing that we do next, we need to actually *flip* those around, so we end up with a `Series` where the index is the unique "val" value column, and the values are the ID numbers
unique_values = pd.Series(data=unique_values_df.index, index=unique_values_df["val"])
# Now we can add a new ID column, after doing the same NA -> "x" transformation, for the same reason as above.
example["val_id"] = unique_values.reindex(example["val"]).values
# Above code is necessary because of a number of shortcomings within pandas:
# 1. A rolling window does not have an `.nunique()` method to count the number of unique values in a rolling window of a groupby - https://github.com./pandas-dev/pandas/issues/26958
# 2. You can only operate on float64 values (and by extension, integers) - https://github.com./pandas-dev/pandas/issues/23002
#
# Now, with all that out of the way, this line 1) creates rolling windows looking at the last *n* entries (an arbitrary value, can be set as high as necessary), and 2) counts the number of values of `val` within those pseudo windows.
example_with_windows = example.groupby("id")["val_id"].rolling(2, min_periods=1).apply(pd.Series.nunique, raw=False)
example_with_windows |
Here is another example (w/o using rolling) that might be useful. I haven't generalize it to groups either.
|
Adding to the list, I have a column of numpy arrays that I'd like to perform rolling group by operations on. |
pls show a concrete example |
Here's a simple abstract example.
In practice, I'm working with a stream of data arriving from various sensors asynchronously. Because the data is arriving asynchronously there isn't an easy way to turn it into row based data without imputing a ton of values. The data also has different dimensions depending on which sensor it is coming from thus why sensor data is stored as an abstract vector instead of semantically consistent columns. |
@mrucker Having exactly the same issue. I'm considering a for loop construct now until the rolling+apply combination works for this use case. |
@alexanderfrey Thanks! For what it is worth, we ended up finding that support is a little better than we realized. For example, this works:
I know it's not quite the same, but just pointing it out in case it helps. For our "work-around" we just ended up not using rolling. We wanted to use rolling to apply a simple noise-reduction filter but ultimately decided the benefit of doing was going to be marginal at best because of additional post-processing that was also happening. |
+1 on this issue. Would make it easy to do context window processing for NLP data |
A related (more simple) example:
Given that internally timestamps are floats, I am surprised why this cannot work. Keeping track of the last timestamp in the rolling dataframe is important to make sure there is no look-ahead bias... What do you think? |
+1, I also support implementing this, since I needed this functionality, but found out it is not implemented! |
+1 for this issue. I needed it for concatenating strings from preceding rows. |
Hi the
Pandas
dream team.I think it would be nice if
rolling
could acceptstrings
as well (see https://stackoverflow.com/questions/52657429/rolling-with-string-variables)With the abundance of textual data nowadays, we want
Pandas
to stay at the top of the curve!Here I want to concatenate the strings in mychart using the values in the last 2 seconds (not the last two observations).
Unfortunately, both attempts below fail miserably
What do you think?
Thanks!
The text was updated successfully, but these errors were encountered: