BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

seth-p · 2014-11-17T22:34:21Z

In the example below, [11] is incorrectly reflecting columns in dfa that should not be visible to dfa1. Note that this is not a problem when the columns are not a multi-index ([5] and [6]), or when dropna=True (the default; [10] and [12]).

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 2.3.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame(np.zeros((2,5)), columns=list('ABCDE'))

In [4]: df1 = df[df.columns[:2]]

In [5]: df1.stack()
Out[5]:
0  A    0
   B    0
1  A    0
   B    0
dtype: float64

In [6]: df1.stack(dropna=False)
Out[6]:
0  A    0
   B    0
1  A    0
   B    0
dtype: float64

In [7]: dfa = pd.DataFrame(np.zeros((2,5)),
                           columns=pd.MultiIndex.from_tuples([(1,'A'), (1,'B'), (1,'C'), (1,'D'), (1,'E')],
                                                             names=['num', 'let']))

In [8]: dfa1 = dfa[dfa.columns[:2]]

In [10]: dfa1.stack()
Out[10]:
num    1
  let
0 A    0
  B    0
1 A    0
  B    0

In [11]: dfa1.stack(dropna=False)
Out[11]:
num     1
  let
0 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN
1 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN

In [12]: dfa1.stack(dropna=True)
Out[12]:
num    1
  let
0 A    0
  B    0
1 A    0
  B    0

In [13]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.1
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.6.0
IPython: 2.3.1
sphinx: None
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.6.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

seth-p · 2014-11-17T22:44:10Z

Curiously doing a deep copy doesn't help:

In [14]: dfa2 = dfa[dfa.columns[:2]].copy()

In [15]: dfa2.stack(dropna=False)
Out[15]:
num     1
  let
0 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN
1 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN

Though specifying all levels does:

In [46]: dfa1.stack(level=[0,1], dropna=False)
Out[46]:
   num  let
0  1    A      0
        B      0
1  1    A      0
        B      0
dtype: float64

Other variants:

In [51]: dfa1.stack(level=0, dropna=False)
Out[51]:
let    A  B
  num
0 1    0  0
1 1    0  0

In [52]: dfa1.stack(level=1, dropna=False)
Out[52]:
num     1
  let
0 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN
1 A     0
  B     0
  C   NaN
  D   NaN
  E   NaN

seth-p · 2014-11-17T23:18:13Z

More oddities I don't understand:

In [90]: dfb = pd.DataFrame(np.zeros((2,5)),
   ....:                    columns=pd.MultiIndex.from_tuples([(1,'A'), (1,'B'), (2,'C'), (2,'D'), (3,'E')],
   ....:                                                      names=['num', 'let']))

In [91]: dfb
Out[91]:
num  1     2     3
let  A  B  C  D  E
0    0  0  0  0  0
1    0  0  0  0  0

In [92]: dfb1 = dfb[dfb.columns[:3]]

In [93]: dfb1
Out[93]:
num  1     2
let  A  B  C
0    0  0  0
1    0  0  0

In [94]: dfb1.stack()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-94-a43c6a602bfd> in <module>()
----> 1 dfb1.stack()

C:\Python34\lib\site-packages\pandas\core\frame.py in stack(self, level, dropna)
   3430             return stack_multiple(self, level, dropna=dropna)
   3431         else:
-> 3432             return stack(self, level, dropna=dropna)
   3433
   3434     def unstack(self, level=-1):

C:\Python34\lib\site-packages\pandas\core\reshape.py in stack(frame, level, dropna)
    529
    530     if isinstance(frame.columns, MultiIndex):
--> 531         return _stack_multi_columns(frame, level=level, dropna=dropna)
    532     elif isinstance(frame.index, MultiIndex):
    533         new_levels = list(frame.index.levels)

C:\Python34\lib\site-packages\pandas\core\reshape.py in _stack_multi_columns(frame, level, dropna)
    648
    649     if len(drop_cols) > 0:
--> 650         new_columns = new_columns - drop_cols
    651
    652     N = len(this)

C:\Python34\lib\site-packages\pandas\core\index.py in _evaluate_numeric_binop(self, other)
   2139                 else:
   2140                     if not (com.is_float(other) or com.is_integer(other)):
-> 2141                         raise TypeError("can only perform ops with scalar values")
   2142
   2143                 # if we are a reversed non-communative op

TypeError: can only perform ops with scalar values

In [95]: dfb1.stack(level=0)
Out[95]:
let     A   B   C
  num
0 1     0   0 NaN
  2   NaN NaN   0
1 1     0   0 NaN
  2   NaN NaN   0

In [96]: dfb1.stack(level=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-c2c854f1a484> in <module>()
----> 1 dfb1.stack(level=1)

C:\Python34\lib\site-packages\pandas\core\frame.py in stack(self, level, dropna)
   3430             return stack_multiple(self, level, dropna=dropna)
   3431         else:
-> 3432             return stack(self, level, dropna=dropna)
   3433
   3434     def unstack(self, level=-1):

C:\Python34\lib\site-packages\pandas\core\reshape.py in stack(frame, level, dropna)
    529
    530     if isinstance(frame.columns, MultiIndex):
--> 531         return _stack_multi_columns(frame, level=level, dropna=dropna)
    532     elif isinstance(frame.index, MultiIndex):
    533         new_levels = list(frame.index.levels)

C:\Python34\lib\site-packages\pandas\core\reshape.py in _stack_multi_columns(frame, level, dropna)
    648
    649     if len(drop_cols) > 0:
--> 650         new_columns = new_columns - drop_cols
    651
    652     N = len(this)

C:\Python34\lib\site-packages\pandas\core\index.py in _evaluate_numeric_binop(self, other)
   2139                 else:
   2140                     if not (com.is_float(other) or com.is_integer(other)):
-> 2141                         raise TypeError("can only perform ops with scalar values")
   2142
   2143                 # if we are a reversed non-communative op

TypeError: can only perform ops with scalar values

In [97]: dfb1.stack(level=[0,1])
Out[97]:
   num  let
0  1    A      0
        B      0
   2    C      0
1  1    A      0
        B      0
   2    C      0
dtype: float64

seth-p · 2014-11-17T23:21:41Z

And yet more oddities (I'd expect [102] to produce the same output as [101]):

In [98]: dfb
Out[98]:
num  1     2     3
let  A  B  C  D  E
0    0  0  0  0  0
1    0  0  0  0  0

In [99]: dfb2 = dfb[dfb.columns[:2]]

In [100]: dfb2
Out[100]:
num  1
let  A  B
0    0  0
1    0  0

In [101]: dfb2.stack(level=[0,1])
Out[101]:
   num  let
0  1    A      0
        B      0
1  1    A      0
        B      0
dtype: float64

In [102]: dfb2.stack(level=[0,1], dropna=False)
Out[102]:
   num  let
0  1    A       0
        B       0
   2    A     NaN
        B     NaN
   3    A     NaN
        B     NaN
1  1    A       0
        B       0
   2    A     NaN
        B     NaN
   3    A     NaN
        B     NaN
dtype: float64

rockg · 2014-11-17T23:25:39Z

Regarding your last comment, dropna=True by default so I expect [101] and [102] to be different.

seth-p · 2014-11-18T05:30:19Z

But there are no NaNs, so I'd expect them to be the same.

jreback · 2014-11-18T11:36:34Z

this is somewhat related to this: #2770

though its an easy fix, just recompute the levels (rather than just take them as is). Index objects are shared, so you need to create a new one here. (rather than using the original and just appending levels). That's why you get the 'hidden' values showing up.

seth-p · 2014-11-18T20:52:01Z

So far as I can tell, all the problems listed above are solved by first calling the following function:

def fix_multiindex_axes(obj):
    for axis_number, axis_index in enumerate(obj.axes):
        if isinstance(axis_index, pd.MultiIndex):
            obj.set_axis(axis_number,
                         pd.MultiIndex.from_tuples(axis_index.get_values(),
                                                   names=axis_index.names))
    return obj

jreback · 2014-11-18T21:41:33Z

@seth-p yes, that essentially recreates the index from scratch thereby removing an aliasing issues.

This should prob be done in unstack though. (with a method called out to MultiIndex, something like _deep_copy / _recreate or something similar).

fyi, similar issue in #8850

like to do a PR?

rockg · 2014-11-18T21:59:58Z

And I think we need to be cognizant of speed issues here as well if we are talking about recreating the entire index again, namely, how does it perform on large frames (500,000x5 DataFrame)?

seth-p · 2014-11-19T02:21:55Z

OK, I'll give it a shot, though if what I have in mind is right I don't think it will address #8850.

seth-p · 2014-11-19T22:16:01Z

OK, I think I have this more or less done, but have a question about the desired behavior of DataFrame.stack() when the MultiIndex levels aren't sorted. Granted this is rare, as it appears that MultiIndex.from_arrays(), MultiIndex.from_product(), and MultiIndex.from_tuples() all produce sorted levels; but one can construct a MultiIndex with unsorted levels directly via the constructor.

In the following example (production 0.15.1), should [140] df.stack() and [141] df1.stack() not produce the same results?

In [129]: mi = MultiIndex.from_tuples([('B','x'), ('B','z'), ('A','y'), ('C','x')],
                                      names=['Upper', 'Lower'])

In [130]: mi1 = MultiIndex(levels=[['B','A','C'], ['y','x','z']],
                           labels=[[0,0,1,2], [1,2,0,1]],
                           names=['Upper', 'Lower'])

In [131]: mi
Out[131]:
MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
           labels=[[1, 1, 0, 2], [0, 2, 1, 0]],
           names=['Upper', 'Lower'])

In [132]: mi1
Out[132]:
MultiIndex(levels=[['B', 'A', 'C'], ['y', 'x', 'z']],
           labels=[[0, 0, 1, 2], [1, 2, 0, 1]],
           names=['Upper', 'Lower'])

In [133]: mi.get_values()
Out[133]: array([('B', 'x'), ('B', 'z'), ('A', 'y'), ('C', 'x')], dtype=object)

In [134]: mi1.get_values()
Out[134]: array([('B', 'x'), ('B', 'z'), ('A', 'y'), ('C', 'x')], dtype=object)

In [135]: df = DataFrame(np.arange(12).reshape(3,4), columns=mi)

In [136]: df1 = DataFrame(np.arange(12).reshape(3,4), columns=mi1)

In [137]: df
Out[137]:
Upper  B      A   C
Lower  x  z   y   x
0      0  1   2   3
1      4  5   6   7
2      8  9  10  11

In [138]: df1
Out[138]:
Upper  B      A   C
Lower  x  z   y   x
0      0  1   2   3
1      4  5   6   7
2      8  9  10  11

In [139]: df.equals(df1)
Out[139]: True

In [140]: df.stack()
Out[140]:
Upper     A   B   C
  Lower
0 x     NaN   0   3
  y       2 NaN NaN
  z     NaN   1 NaN
1 x     NaN   4   7
  y       6 NaN NaN
  z     NaN   5 NaN
2 x     NaN   8  11
  y      10 NaN NaN
  z     NaN   9 NaN

In [141]: df1.stack()
Out[141]:
Upper     B   A   C
  Lower
0 y     NaN   2 NaN
  x       0 NaN   3
  z       1 NaN NaN
1 y     NaN   6 NaN
  x       4 NaN   7
  z       5 NaN NaN
2 y     NaN  10 NaN
  x       8 NaN  11
  z       9 NaN NaN

In [142]: df.stack().equals(df1.stack())
Out[142]: False

seth-p · 2014-11-20T23:37:33Z

@jreback, I think this is done in #8855 pending the answer to the question above re: unordered MultiIndex.levels. Also, let me know if you think additional tests are required.

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 18, 2014

jreback added this to the 0.15.2 milestone Nov 18, 2014

jreback mentioned this issue Nov 18, 2014

BUG: Accessing DataFrame multi-index column *seems* to modify its content #8850

Closed

seth-p mentioned this issue Nov 19, 2014

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. #8855

Merged

jreback closed this as completed in #8855 Dec 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

seth-p commented Nov 17, 2014

seth-p commented Nov 17, 2014

seth-p commented Nov 17, 2014

seth-p commented Nov 17, 2014

rockg commented Nov 17, 2014

seth-p commented Nov 18, 2014

jreback commented Nov 18, 2014

seth-p commented Nov 18, 2014

jreback commented Nov 18, 2014

rockg commented Nov 18, 2014

seth-p commented Nov 19, 2014

seth-p commented Nov 19, 2014

seth-p commented Nov 20, 2014

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

Comments

seth-p commented Nov 17, 2014

seth-p commented Nov 17, 2014

seth-p commented Nov 17, 2014

seth-p commented Nov 17, 2014

rockg commented Nov 17, 2014

seth-p commented Nov 18, 2014

jreback commented Nov 18, 2014

seth-p commented Nov 18, 2014

jreback commented Nov 18, 2014

rockg commented Nov 18, 2014

seth-p commented Nov 19, 2014

seth-p commented Nov 19, 2014

seth-p commented Nov 20, 2014