-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby with no non-empty groups, #21624 #21849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes bug where operations such as transform('sum') raise errors when only a single null group exists.
tests! |
Codecov Report
@@ Coverage Diff @@
## master #21849 +/- ##
==========================================
- Coverage 91.91% 91.91% -0.01%
==========================================
Files 164 164
Lines 49987 49992 +5
==========================================
+ Hits 45947 45951 +4
- Misses 4040 4041 +1
Continue to review full report at Codecov.
|
Added in a test for this specific case. The other case where there are some but not all nans present is already covered extensively in the tests. |
'values': [1, 2, 3]}) | ||
|
||
grouped = df.groupby('groups') | ||
summed = grouped['values'].transform('sum') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but we mostly use result
to compare against expected
- can you change the variable name here and on the assertion line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
|
||
grouped = df.groupby('groups') | ||
result = grouped['values'].transform('sum') | ||
expected = Series([np.nan, np.nan, np.nan], name='values') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe I'm missing the point but shouldn't these be 6
and not np.nan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the behavior is to output np.nan
for anything in a group with label np.nan
. Making this output 6
would be changing the interpretation of np.nan
from not in any group
to in a group with label np.nan
, which is going to be a pretty significant change in how things work. The treatment here of ignoring NA values is specified in the groupby docs.
For example, see test_groupby_transform_with_nan_group()
in test_transform.py
pls rebase, just reorged the groupby code a bit. |
Should be ready for review now |
@@ -235,7 +235,7 @@ def size(self): | |||
if ngroup: | |||
out = np.bincount(ids[ids != -1], minlength=ngroup) | |||
else: | |||
out = ids | |||
out = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm something just seems strange to me about this. So your test example works fine, but what happens if we have a DataFrame that has other groups besides the all-NA one? Can you add a test for that and make sure that still works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. In that case, the else block should never be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking more through this as other similar issues have popped up - are we now invalidating the docs below?
http://pandas.pydata.org/pandas-docs/stable/missing_data.html#na-values-in-groupby
If so we should do a separate PR to update docs as well, though somewhat surprised this didn't break any other tests
tm.assert_series_equal(result, expected) | ||
|
||
# Make sure the standard case still works too | ||
df = DataFrame({'groups': [np.nan, 'A', 'A', 'B', 'B'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this follows the same pattern as the code above it would be preferable to parametrize it instead of re-coding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in the latest update
I think nothing is breaking because this shouldn't be changing anything other than preventing a ValueError in an edge case. I think there is some ambiguity in the docs where we want to ignore NA values in group labels but also have transform return a series that's the same shape as the input. It would definitely be good to have the behavior of transform in the presence of NA group labels explicitly spelled out somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check out the current 23.X branch and see if your new test pass). its possible this was fixed in master, but I don't think the change was backported.
@jreback just tried locally and this wasn't working on master, so I think this change is still necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a whatsnew note for v0.23.5 and merge latest updates from master?
Closing as stale. Ping if you'd like to pick this back up |
Should close #21624. Didn't see any issues running test_fast.sh on my laptop so I don't think it broke anything in the process. The problem here is that when there is only a null group, then
self.result_index
is empty, butids
is just a list of -1s, so a ValueError is raised when trying to create the series.