Skip to content

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
seth-p opened this issue Nov 17, 2014 · 12 comments · Fixed by #8855
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@seth-p
Copy link
Contributor

seth-p commented Nov 17, 2014

xref #8850

In the example below, [11] is incorrectly reflecting columns in dfa that should not be visible to dfa1. Note that this is not a problem when the columns are not a multi-index ([5] and [6]), or when dropna=True (the default; [10] and [12]).

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 2.3.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame(np.zeros((2,5)), columns=list('ABCDE'))

In [4]: df1 = df[df.columns[:2]]

In [5]: df1.stack()
Out[5]:
0  A    0
   B    0
1  A    0
   B    0
dtype: float64

In [6]: df1.stack(dropna=False)
Out[6]:
0  A    0
   B    0
1  A    0
   B    0
dtype: float64

In [7]: dfa = pd.DataFrame(np.zeros((2,5)),
                           columns=pd.MultiIndex.from_tuples([(1,'A'), (1,'B'), (1,'C'), (1,'D'), (1,'E')],
                                                             names=['num', 'let']))

In [8]: dfa1 = dfa[dfa.columns[:2]]

In [10]: dfa1.stack()
Out[10]:
num    1
  let
0 A    0
  B    0
1 A    0
  B    0

In [11]: dfa1.stack(dropna=False)
Out[11]:
num     1
  let
0 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN
1 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN

In [12]: dfa1.stack(dropna=True)
Out[12]:
num    1
  let
0 A    0
  B    0
1 A    0
  B    0

In [13]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.1
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.6.0
IPython: 2.3.1
sphinx: None
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.6.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: None
@seth-p
Copy link
Contributor Author

seth-p commented Nov 17, 2014

Curiously doing a deep copy doesn't help:

In [14]: dfa2 = dfa[dfa.columns[:2]].copy()

In [15]: dfa2.stack(dropna=False)
Out[15]:
num     1
  let
0 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN
1 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN

Though specifying all levels does:

In [46]: dfa1.stack(level=[0,1], dropna=False)
Out[46]:
   num  let
0  1    A      0
        B      0
1  1    A      0
        B      0
dtype: float64

Other variants:

In [51]: dfa1.stack(level=0, dropna=False)
Out[51]:
let    A  B
  num
0 1    0  0
1 1    0  0

In [52]: dfa1.stack(level=1, dropna=False)
Out[52]:
num     1
  let
0 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN
1 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN

@seth-p
Copy link
Contributor Author

seth-p commented Nov 17, 2014

More oddities I don't understand:

In [90]: dfb = pd.DataFrame(np.zeros((2,5)),
   ....:                    columns=pd.MultiIndex.from_tuples([(1,'A'), (1,'B'), (2,'C'), (2,'D'), (3,'E')],
   ....:                                                      names=['num', 'let']))

In [91]: dfb
Out[91]:
num  1     2     3
let  A  B  C  D  E
0    0  0  0  0  0
1    0  0  0  0  0

In [92]: dfb1 = dfb[dfb.columns[:3]]

In [93]: dfb1
Out[93]:
num  1     2
let  A  B  C
0    0  0  0
1    0  0  0

In [94]: dfb1.stack()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-94-a43c6a602bfd> in <module>()
----> 1 dfb1.stack()

C:\Python34\lib\site-packages\pandas\core\frame.py in stack(self, level, dropna)
   3430             return stack_multiple(self, level, dropna=dropna)
   3431         else:
-> 3432             return stack(self, level, dropna=dropna)
   3433
   3434     def unstack(self, level=-1):

C:\Python34\lib\site-packages\pandas\core\reshape.py in stack(frame, level, dropna)
    529
    530     if isinstance(frame.columns, MultiIndex):
--> 531         return _stack_multi_columns(frame, level=level, dropna=dropna)
    532     elif isinstance(frame.index, MultiIndex):
    533         new_levels = list(frame.index.levels)

C:\Python34\lib\site-packages\pandas\core\reshape.py in _stack_multi_columns(frame, level, dropna)
    648
    649     if len(drop_cols) > 0:
--> 650         new_columns = new_columns - drop_cols
    651
    652     N = len(this)

C:\Python34\lib\site-packages\pandas\core\index.py in _evaluate_numeric_binop(self, other)
   2139                 else:
   2140                     if not (com.is_float(other) or com.is_integer(other)):
-> 2141                         raise TypeError("can only perform ops with scalar values")
   2142
   2143                 # if we are a reversed non-communative op

TypeError: can only perform ops with scalar values

In [95]: dfb1.stack(level=0)
Out[95]:
let     A   B   C
  num
0 1     0   0 NaN
  2   NaN NaN   0
1 1     0   0 NaN
  2   NaN NaN   0

In [96]: dfb1.stack(level=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-c2c854f1a484> in <module>()
----> 1 dfb1.stack(level=1)

C:\Python34\lib\site-packages\pandas\core\frame.py in stack(self, level, dropna)
   3430             return stack_multiple(self, level, dropna=dropna)
   3431         else:
-> 3432             return stack(self, level, dropna=dropna)
   3433
   3434     def unstack(self, level=-1):

C:\Python34\lib\site-packages\pandas\core\reshape.py in stack(frame, level, dropna)
    529
    530     if isinstance(frame.columns, MultiIndex):
--> 531         return _stack_multi_columns(frame, level=level, dropna=dropna)
    532     elif isinstance(frame.index, MultiIndex):
    533         new_levels = list(frame.index.levels)

C:\Python34\lib\site-packages\pandas\core\reshape.py in _stack_multi_columns(frame, level, dropna)
    648
    649     if len(drop_cols) > 0:
--> 650         new_columns = new_columns - drop_cols
    651
    652     N = len(this)

C:\Python34\lib\site-packages\pandas\core\index.py in _evaluate_numeric_binop(self, other)
   2139                 else:
   2140                     if not (com.is_float(other) or com.is_integer(other)):
-> 2141                         raise TypeError("can only perform ops with scalar values")
   2142
   2143                 # if we are a reversed non-communative op

TypeError: can only perform ops with scalar values

In [97]: dfb1.stack(level=[0,1])
Out[97]:
   num  let
0  1    A      0
        B      0
   2    C      0
1  1    A      0
        B      0
   2    C      0
dtype: float64

@seth-p
Copy link
Contributor Author

seth-p commented Nov 17, 2014

And yet more oddities (I'd expect [102] to produce the same output as [101]):

In [98]: dfb
Out[98]:
num  1     2     3
let  A  B  C  D  E
0    0  0  0  0  0
1    0  0  0  0  0

In [99]: dfb2 = dfb[dfb.columns[:2]]

In [100]: dfb2
Out[100]:
num  1
let  A  B
0    0  0
1    0  0

In [101]: dfb2.stack(level=[0,1])
Out[101]:
   num  let
0  1    A      0
        B      0
1  1    A      0
        B      0
dtype: float64

In [102]: dfb2.stack(level=[0,1], dropna=False)
Out[102]:
   num  let
0  1    A       0
        B       0
   2    A     NaN
        B     NaN
   3    A     NaN
        B     NaN
1  1    A       0
        B       0
   2    A     NaN
        B     NaN
   3    A     NaN
        B     NaN
dtype: float64

@rockg
Copy link
Contributor

rockg commented Nov 17, 2014

Regarding your last comment, dropna=True by default so I expect [101] and [102] to be different.

@seth-p
Copy link
Contributor Author

seth-p commented Nov 18, 2014

But there are no NaNs, so I'd expect them to be the same.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

this is somewhat related to this: #2770

though its an easy fix, just recompute the levels (rather than just take them as is). Index objects are shared, so you need to create a new one here. (rather than using the original and just appending levels). That's why you get the 'hidden' values showing up.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 18, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 18, 2014
@seth-p
Copy link
Contributor Author

seth-p commented Nov 18, 2014

So far as I can tell, all the problems listed above are solved by first calling the following function:

def fix_multiindex_axes(obj):
    for axis_number, axis_index in enumerate(obj.axes):
        if isinstance(axis_index, pd.MultiIndex):
            obj.set_axis(axis_number,
                         pd.MultiIndex.from_tuples(axis_index.get_values(),
                                                   names=axis_index.names))
    return obj

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

@seth-p yes, that essentially recreates the index from scratch thereby removing an aliasing issues.

This should prob be done in unstack though. (with a method called out to MultiIndex, something like _deep_copy / _recreate or something similar).

fyi, similar issue in #8850

like to do a PR?

@rockg
Copy link
Contributor

rockg commented Nov 18, 2014

And I think we need to be cognizant of speed issues here as well if we are talking about recreating the entire index again, namely, how does it perform on large frames (500,000x5 DataFrame)?

@seth-p
Copy link
Contributor Author

seth-p commented Nov 19, 2014

OK, I'll give it a shot, though if what I have in mind is right I don't think it will address #8850.

@seth-p
Copy link
Contributor Author

seth-p commented Nov 19, 2014

OK, I think I have this more or less done, but have a question about the desired behavior of DataFrame.stack() when the MultiIndex levels aren't sorted. Granted this is rare, as it appears that MultiIndex.from_arrays(), MultiIndex.from_product(), and MultiIndex.from_tuples() all produce sorted levels; but one can construct a MultiIndex with unsorted levels directly via the constructor.

In the following example (production 0.15.1), should [140] df.stack() and [141] df1.stack() not produce the same results?

In [129]: mi = MultiIndex.from_tuples([('B','x'), ('B','z'), ('A','y'), ('C','x')],
                                      names=['Upper', 'Lower'])

In [130]: mi1 = MultiIndex(levels=[['B','A','C'], ['y','x','z']],
                           labels=[[0,0,1,2], [1,2,0,1]],
                           names=['Upper', 'Lower'])

In [131]: mi
Out[131]:
MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
           labels=[[1, 1, 0, 2], [0, 2, 1, 0]],
           names=['Upper', 'Lower'])

In [132]: mi1
Out[132]:
MultiIndex(levels=[['B', 'A', 'C'], ['y', 'x', 'z']],
           labels=[[0, 0, 1, 2], [1, 2, 0, 1]],
           names=['Upper', 'Lower'])

In [133]: mi.get_values()
Out[133]: array([('B', 'x'), ('B', 'z'), ('A', 'y'), ('C', 'x')], dtype=object)

In [134]: mi1.get_values()
Out[134]: array([('B', 'x'), ('B', 'z'), ('A', 'y'), ('C', 'x')], dtype=object)

In [135]: df = DataFrame(np.arange(12).reshape(3,4), columns=mi)

In [136]: df1 = DataFrame(np.arange(12).reshape(3,4), columns=mi1)

In [137]: df
Out[137]:
Upper  B      A   C
Lower  x  z   y   x
0      0  1   2   3
1      4  5   6   7
2      8  9  10  11

In [138]: df1
Out[138]:
Upper  B      A   C
Lower  x  z   y   x
0      0  1   2   3
1      4  5   6   7
2      8  9  10  11

In [139]: df.equals(df1)
Out[139]: True

In [140]: df.stack()
Out[140]:
Upper     A   B   C
  Lower
0 x     NaN   0   3
  y       2 NaN NaN
  z     NaN   1 NaN
1 x     NaN   4   7
  y       6 NaN NaN
  z     NaN   5 NaN
2 x     NaN   8  11
  y      10 NaN NaN
  z     NaN   9 NaN

In [141]: df1.stack()
Out[141]:
Upper     B   A   C
  Lower
0 y     NaN   2 NaN
  x       0 NaN   3
  z       1 NaN NaN
1 y     NaN   6 NaN
  x       4 NaN   7
  z       5 NaN NaN
2 y     NaN  10 NaN
  x       8 NaN  11
  z       9 NaN NaN

In [142]: df.stack().equals(df1.stack())
Out[142]: False

@seth-p
Copy link
Contributor Author

seth-p commented Nov 20, 2014

@jreback, I think this is done in #8855 pending the answer to the question above re: unordered MultiIndex.levels. Also, let me know if you think additional tests are required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants