Skip to content

BUG: groupby with no non-empty groups, #21624 #21849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

lopez86
Copy link

@lopez86 lopez86 commented Jul 11, 2018

Should close #21624. Didn't see any issues running test_fast.sh on my laptop so I don't think it broke anything in the process. The problem here is that when there is only a null group, then self.result_index is empty, but ids is just a list of -1s, so a ValueError is raised when trying to create the series.

lopez86 added 2 commits July 10, 2018 19:46
Fixes bug where operations such as transform('sum') raise errors when only a single null group exists.
@jreback
Copy link
Contributor

jreback commented Jul 11, 2018

tests!

@codecov
Copy link

codecov bot commented Jul 11, 2018

Codecov Report

Merging #21849 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21849      +/-   ##
==========================================
- Coverage   91.91%   91.91%   -0.01%     
==========================================
  Files         164      164              
  Lines       49987    49992       +5     
==========================================
+ Hits        45947    45951       +4     
- Misses       4040     4041       +1
Flag Coverage Δ
#multiple 90.3% <100%> (-0.01%) ⬇️
#single 42.16% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/ops.py 96.35% <100%> (ø) ⬆️
pandas/core/reshape/merge.py 94.1% <0%> (-0.16%) ⬇️
pandas/_libs/tslibs/__init__.py 100% <0%> (ø) ⬆️
pandas/io/formats/style.py 96.16% <0%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d94a95...6d2855c. Read the comment docs.

@lopez86
Copy link
Author

lopez86 commented Jul 11, 2018

Added in a test for this specific case. The other case where there are some but not all nans present is already covered extensively in the tests.

'values': [1, 2, 3]})

grouped = df.groupby('groups')
summed = grouped['values'].transform('sum')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but we mostly use result to compare against expected - can you change the variable name here and on the assertion line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


grouped = df.groupby('groups')
result = grouped['values'].transform('sum')
expected = Series([np.nan, np.nan, np.nan], name='values')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe I'm missing the point but shouldn't these be 6 and not np.nan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the behavior is to output np.nan for anything in a group with label np.nan. Making this output 6 would be changing the interpretation of np.nan from not in any group to in a group with label np.nan, which is going to be a pretty significant change in how things work. The treatment here of ignoring NA values is specified in the groupby docs.

For example, see test_groupby_transform_with_nan_group() in test_transform.py

@jreback
Copy link
Contributor

jreback commented Jul 12, 2018

pls rebase, just reorged the groupby code a bit.

@jreback jreback changed the title BUG: Fix groupby issue, #21624 BUG: groupby with no non-empty groups, #21624 Jul 12, 2018
@lopez86
Copy link
Author

lopez86 commented Jul 12, 2018

Should be ready for review now

@@ -235,7 +235,7 @@ def size(self):
if ngroup:
out = np.bincount(ids[ids != -1], minlength=ngroup)
else:
out = ids
out = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm something just seems strange to me about this. So your test example works fine, but what happens if we have a DataFrame that has other groups besides the all-NA one? Can you add a test for that and make sure that still works?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. In that case, the else block should never be used

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking more through this as other similar issues have popped up - are we now invalidating the docs below?

http://pandas.pydata.org/pandas-docs/stable/missing_data.html#na-values-in-groupby

If so we should do a separate PR to update docs as well, though somewhat surprised this didn't break any other tests

tm.assert_series_equal(result, expected)

# Make sure the standard case still works too
df = DataFrame({'groups': [np.nan, 'A', 'A', 'B', 'B'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this follows the same pattern as the code above it would be preferable to parametrize it instead of re-coding

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in the latest update

@lopez86
Copy link
Author

lopez86 commented Jul 13, 2018

I think nothing is breaking because this shouldn't be changing anything other than preventing a ValueError in an edge case. I think there is some ambiguity in the docs where we want to ignore NA values in group labels but also have transform return a series that's the same shape as the input. It would definitely be good to have the behavior of transform in the presence of NA group labels explicitly spelled out somewhere.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check out the current 23.X branch and see if your new test pass). its possible this was fixed in master, but I don't think the change was backported.

@WillAyd
Copy link
Member

WillAyd commented Aug 17, 2018

@jreback just tried locally and this wasn't working on master, so I think this change is still necessary

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew note for v0.23.5 and merge latest updates from master?

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

Closing as stale. Ping if you'd like to pick this back up

@WillAyd WillAyd closed this Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby w/ only NULL-groups crashes since 0.23
3 participants