BUG: groupby with no non-empty groups, #21624 #21849

lopez86 · 2018-07-11T00:42:53Z

Should close #21624. Didn't see any issues running test_fast.sh on my laptop so I don't think it broke anything in the process. The problem here is that when there is only a null group, then self.result_index is empty, but ids is just a list of -1s, so a ValueError is raised when trying to create the series.

Fixes bug where operations such as transform('sum') raise errors when only a single null group exists.

jreback · 2018-07-11T00:46:12Z

tests!

codecov · 2018-07-11T04:07:52Z

Codecov Report

Merging #21849 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21849      +/-   ##
==========================================
- Coverage   91.91%   91.91%   -0.01%     
==========================================
  Files         164      164              
  Lines       49987    49992       +5     
==========================================
+ Hits        45947    45951       +4     
- Misses       4040     4041       +1

Flag	Coverage Δ
#multiple	`90.3% <100%> (-0.01%)`	⬇️
#single	`42.16% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/ops.py	`96.35% <100%> (ø)`	⬆️
pandas/core/reshape/merge.py	`94.1% <0%> (-0.16%)`	⬇️
pandas/_libs/tslibs/__init__.py	`100% <0%> (ø)`	⬆️
pandas/io/formats/style.py	`96.16% <0%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d94a95...6d2855c. Read the comment docs.

lopez86 · 2018-07-11T14:13:35Z

Added in a test for this specific case. The other case where there are some but not all nans present is already covered extensively in the tests.

WillAyd · 2018-07-11T15:26:50Z

pandas/tests/groupby/test_transform.py

+                    'values': [1, 2, 3]})
+
+    grouped = df.groupby('groups')
+    summed = grouped['values'].transform('sum')


Minor nit but we mostly use result to compare against expected - can you change the variable name here and on the assertion line?

WillAyd · 2018-07-11T18:47:05Z

pandas/tests/groupby/test_transform.py

+
+    grouped = df.groupby('groups')
+    result = grouped['values'].transform('sum')
+    expected = Series([np.nan, np.nan, np.nan], name='values')


Hmm maybe I'm missing the point but shouldn't these be 6 and not np.nan?

Currently, the behavior is to output np.nan for anything in a group with label np.nan. Making this output 6 would be changing the interpretation of np.nan from not in any group to in a group with label np.nan, which is going to be a pretty significant change in how things work. The treatment here of ignoring NA values is specified in the groupby docs.

For example, see test_groupby_transform_with_nan_group() in test_transform.py

jreback · 2018-07-12T00:22:54Z

pls rebase, just reorged the groupby code a bit.

lopez86 · 2018-07-12T14:37:11Z

Should be ready for review now

WillAyd · 2018-07-13T01:47:58Z

pandas/core/groupby/ops.py

@@ -235,7 +235,7 @@ def size(self):
        if ngroup:
            out = np.bincount(ids[ids != -1], minlength=ngroup)
        else:
-            out = ids
+            out = []


Hmm something just seems strange to me about this. So your test example works fine, but what happens if we have a DataFrame that has other groups besides the all-NA one? Can you add a test for that and make sure that still works?

Done. In that case, the else block should never be used

WillAyd

Just thinking more through this as other similar issues have popped up - are we now invalidating the docs below?

http://pandas.pydata.org/pandas-docs/stable/missing_data.html#na-values-in-groupby

If so we should do a separate PR to update docs as well, though somewhat surprised this didn't break any other tests

WillAyd · 2018-07-13T14:45:43Z

pandas/tests/groupby/test_transform.py

+    tm.assert_series_equal(result, expected)
+
+    # Make sure the standard case still works too
+    df = DataFrame({'groups': [np.nan, 'A', 'A', 'B', 'B'],


Since this follows the same pattern as the code above it would be preferable to parametrize it instead of re-coding

This is done in the latest update

lopez86 · 2018-07-13T15:31:11Z

I think nothing is breaking because this shouldn't be changing anything other than preventing a ValueError in an edge case. I think there is some ambiguity in the docs where we want to ignore NA values in group labels but also have transform return a series that's the same shape as the input. It would definitely be good to have the behavior of transform in the presence of NA group labels explicitly spelled out somewhere.

jreback

can you check out the current 23.X branch and see if your new test pass). its possible this was fixed in master, but I don't think the change was backported.

WillAyd · 2018-08-17T11:30:05Z

@jreback just tried locally and this wasn't working on master, so I think this change is still necessary

WillAyd

Can you add a whatsnew note for v0.23.5 and merge latest updates from master?

WillAyd · 2018-11-23T03:41:48Z

Closing as stale. Ping if you'd like to pick this back up

lopez86 added 2 commits July 10, 2018 19:46

BUG: Fix groupby bug pandas-dev#21624.

3543449

Fixes bug where operations such as transform('sum') raise errors when only a single null group exists.

Improving fix to bug pandas-dev#21624.

eddcbd1

Adding test to check that the issue is resolved.

59b3c45

WillAyd requested changes Jul 11, 2018

View reviewed changes

Changing variable name to match established practices

d47beea

WillAyd requested changes Jul 11, 2018

View reviewed changes

jreback added Bug Groupby labels Jul 12, 2018

jreback changed the title ~~BUG: Fix groupby issue, #21624~~ BUG: groupby with no non-empty groups, #21624 Jul 12, 2018

lopez86 added 2 commits July 11, 2018 23:28

Merge remote-tracking branch 'upstream/master' into fix_groupby_issue

bda628c

Update to deal with refactor

6d03bc9

WillAyd requested changes Jul 13, 2018

View reviewed changes

Adding text to make sure we didn't break the behavior in non-edge cases

2fd0e73

WillAyd requested changes Jul 13, 2018

View reviewed changes

Changed test to use parametrize decorator.

6d2855c

WillAyd approved these changes Jul 14, 2018

View reviewed changes

jreback requested changes Jul 20, 2018

View reviewed changes

WillAyd requested changes Aug 17, 2018

View reviewed changes

WillAyd closed this Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby with no non-empty groups, #21624 #21849

BUG: groupby with no non-empty groups, #21624 #21849

lopez86 commented Jul 11, 2018

jreback commented Jul 11, 2018

codecov bot commented Jul 11, 2018 •

edited

Loading

lopez86 commented Jul 11, 2018

WillAyd Jul 11, 2018

lopez86 Jul 11, 2018

WillAyd Jul 11, 2018

lopez86 Jul 11, 2018

jreback commented Jul 12, 2018

lopez86 commented Jul 12, 2018

WillAyd Jul 13, 2018

lopez86 Jul 13, 2018

WillAyd left a comment

WillAyd Jul 13, 2018

lopez86 Jul 13, 2018

lopez86 commented Jul 13, 2018

jreback left a comment

WillAyd commented Aug 17, 2018

WillAyd left a comment

WillAyd commented Nov 23, 2018

BUG: groupby with no non-empty groups, #21624 #21849

BUG: groupby with no non-empty groups, #21624 #21849

Conversation

lopez86 commented Jul 11, 2018

jreback commented Jul 11, 2018

codecov bot commented Jul 11, 2018 • edited Loading

Codecov Report

lopez86 commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 12, 2018

lopez86 commented Jul 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lopez86 commented Jul 13, 2018

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Aug 17, 2018

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd commented Nov 23, 2018

codecov bot commented Jul 11, 2018 •

edited

Loading