column multiindex and reindex inconsistency #16626

dfolch · 2017-06-07T22:11:30Z

In [3]: df1 = pd.DataFrame(np.random.randint(0,10,(4,3)), columns=['a','b','c'])

In [4]: df1
Out[4]: 
   a  b  c
0  9  3  2
1  9  0  2
2  7  7  6
3  3  4  2

In [5]: df2 = df1.reindex(columns=['c','a','d'])

In [6]: df2
Out[6]: 
   c  a   d
0  2  9 NaN
1  2  9 NaN
2  6  7 NaN
3  2  3 NaN

In [7]: df2.columns
Out[7]: Index([u'c', u'a', u'd'], dtype='object')

In [8]: ci = pd.MultiIndex.from_product([['x','y'],['a','b','c']])

In [10]: df3 = pd.DataFrame(np.random.randint(0,10,(4,6)), columns=ci)

In [11]: df3
Out[11]: 
   x        y      
   a  b  c  a  b  c
0  3  1  5  3  8  7
1  2  7  6  8  7  4
2  8  3  5  7  1  1
3  7  5  8  7  8  7

In [12]: df4 = df3.reindex(columns=['c','a','d'], level=1)

In [13]: df4
Out[13]: 
   x     y   
   c  a  c  a
0  5  3  7  3
1  6  2  4  8
2  5  8  1  7
3  8  7  7  7

In [14]: df4.columns
Out[14]: 
MultiIndex(levels=[[u'x', u'y'], [u'c', u'a', u'd']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

When passing a nonexistent column name to reindex on a dataframe without multiindex columns, the result is:

a NaN column with the "new" column name
the columns attribute matches the columns in the dataframe

The same action on a multiindex dataframe produces different results:

there are no NaN columns (this may not be a problem)
the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 16.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-06-08T20:47:20Z

the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)

If you look at df4.columns, you'll notice that 'd' is present:

In [65]: df4.columns.levels[1]
Out[65]: Index(['c', 'a', 'd'], dtype='object')

Going back to your first point, that there are no NaN columns in the output, that's because there are no labels for that level (notice that there aren't any 2s in your Out[14]. If you want similar behavior, you'd have to reindex on both levels

In [77]: df3.reindex(columns=[('x', 'a'), ('x', 'd')])
Out[77]:
   x
   a   d
0  2 NaN
1  9 NaN
2  8 NaN
3  7 NaN

In [78]: df3.reindex(columns=[('x', 'a'), ('x', 'd')]).columns
Out[78]:
MultiIndex(levels=[['x'], ['a', 'd']],
           labels=[[0, 0], [0, 1]])

Make sense?

jreback · 2017-06-09T10:46:25Z

duplicate of this issue: #12319.

This ATM is a deliberate choice. Feel free to comment on the other issue.

dfolch · 2017-06-10T21:44:52Z

The NaN column issue is tricky for sure. Not everyone wants the same lower level columns under every higher level column.

The columns attribute still seems a little weird since it can contain column names that are not in the dataframe. In my use case I did a lot of manipulations to a dataframe, and then wanted to see which columns names remained. Below is my solution:

In [62]: ci2 = pd.MultiIndex.from_tuples([('x','a'),('y','a'),('x','b'),('y','b'),('x','c')])

In [63]: df5 = pd.DataFrame(np.random.randint(0,10,(4,5)), columns=ci2)  

In [64]: df6 = df5.drop(('x','c'), axis=1)

In [65]: df6
Out[65]: 
   x  y  x  y
   a  a  b  b
0  0  0  8  4
1  7  2  0  4
2  4  2  2  8
3  6  1  7  5

In [66]: df6.columns
Out[66]: 
MultiIndex(levels=[[u'x', u'y'], [u'a', u'b', u'c']],
           labels=[[0, 1, 0, 1], [0, 0, 1, 1]])

In [67]: df6.columns.levels[1][pd.unique(df6.columns.labels[1])]
Out[67]: Index([u'a', u'b'], dtype='object')

jreback · 2017-06-10T22:37:42Z

this is defined behavior on how s MultiIndex works
you can see http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels if you think remove_unused_levels

dfolch · 2017-06-11T15:53:48Z

I see now that there was a debate on this in #2770. remove_unused_levels is exactly what I needed. Thanks for implementing that @jreback.

jreback · 2017-06-11T16:07:51Z

@dfolch great!

yeah its not automatic, but at least possible now (in an easy way).

jreback closed this as completed Jun 9, 2017

jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jun 9, 2017

jreback added this to the No action milestone Jun 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column multiindex and reindex inconsistency #16626

column multiindex and reindex inconsistency #16626

dfolch commented Jun 7, 2017

TomAugspurger commented Jun 8, 2017

jreback commented Jun 9, 2017

dfolch commented Jun 10, 2017

jreback commented Jun 10, 2017

dfolch commented Jun 11, 2017

jreback commented Jun 11, 2017

column multiindex and reindex inconsistency #16626

column multiindex and reindex inconsistency #16626

Comments

dfolch commented Jun 7, 2017

TomAugspurger commented Jun 8, 2017

jreback commented Jun 9, 2017

dfolch commented Jun 10, 2017

jreback commented Jun 10, 2017

dfolch commented Jun 11, 2017

jreback commented Jun 11, 2017