Skip to content

column multiindex and reindex inconsistency #16626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dfolch opened this issue Jun 7, 2017 · 6 comments
Closed

column multiindex and reindex inconsistency #16626

dfolch opened this issue Jun 7, 2017 · 6 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@dfolch
Copy link

dfolch commented Jun 7, 2017

In [3]: df1 = pd.DataFrame(np.random.randint(0,10,(4,3)), columns=['a','b','c'])

In [4]: df1
Out[4]: 
   a  b  c
0  9  3  2
1  9  0  2
2  7  7  6
3  3  4  2

In [5]: df2 = df1.reindex(columns=['c','a','d'])

In [6]: df2
Out[6]: 
   c  a   d
0  2  9 NaN
1  2  9 NaN
2  6  7 NaN
3  2  3 NaN

In [7]: df2.columns
Out[7]: Index([u'c', u'a', u'd'], dtype='object')

In [8]: ci = pd.MultiIndex.from_product([['x','y'],['a','b','c']])

In [10]: df3 = pd.DataFrame(np.random.randint(0,10,(4,6)), columns=ci)

In [11]: df3
Out[11]: 
   x        y      
   a  b  c  a  b  c
0  3  1  5  3  8  7
1  2  7  6  8  7  4
2  8  3  5  7  1  1
3  7  5  8  7  8  7

In [12]: df4 = df3.reindex(columns=['c','a','d'], level=1)

In [13]: df4
Out[13]: 
   x     y   
   c  a  c  a
0  5  3  7  3
1  6  2  4  8
2  5  8  1  7
3  8  7  7  7

In [14]: df4.columns
Out[14]: 
MultiIndex(levels=[[u'x', u'y'], [u'c', u'a', u'd']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

When passing a nonexistent column name to reindex on a dataframe without multiindex columns, the result is:

  • a NaN column with the "new" column name
  • the columns attribute matches the columns in the dataframe

The same action on a multiindex dataframe produces different results:

  • there are no NaN columns (this may not be a problem)
  • the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)
INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 16.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)

If you look at df4.columns, you'll notice that 'd' is present:

In [65]: df4.columns.levels[1]
Out[65]: Index(['c', 'a', 'd'], dtype='object')

Going back to your first point, that there are no NaN columns in the output, that's because there are no labels for that level (notice that there aren't any 2s in your Out[14]. If you want similar behavior, you'd have to reindex on both levels

In [77]: df3.reindex(columns=[('x', 'a'), ('x', 'd')])
Out[77]:
   x
   a   d
0  2 NaN
1  9 NaN
2  8 NaN
3  7 NaN

In [78]: df3.reindex(columns=[('x', 'a'), ('x', 'd')]).columns
Out[78]:
MultiIndex(levels=[['x'], ['a', 'd']],
           labels=[[0, 0], [0, 1]])

Make sense?

@jreback
Copy link
Contributor

jreback commented Jun 9, 2017

duplicate of this issue: #12319.

This ATM is a deliberate choice. Feel free to comment on the other issue.

@jreback jreback closed this as completed Jun 9, 2017
@jreback jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jun 9, 2017
@jreback jreback added this to the No action milestone Jun 9, 2017
@dfolch
Copy link
Author

dfolch commented Jun 10, 2017

The NaN column issue is tricky for sure. Not everyone wants the same lower level columns under every higher level column.

The columns attribute still seems a little weird since it can contain column names that are not in the dataframe. In my use case I did a lot of manipulations to a dataframe, and then wanted to see which columns names remained. Below is my solution:

In [62]: ci2 = pd.MultiIndex.from_tuples([('x','a'),('y','a'),('x','b'),('y','b'),('x','c')])

In [63]: df5 = pd.DataFrame(np.random.randint(0,10,(4,5)), columns=ci2)  

In [64]: df6 = df5.drop(('x','c'), axis=1)

In [65]: df6
Out[65]: 
   x  y  x  y
   a  a  b  b
0  0  0  8  4
1  7  2  0  4
2  4  2  2  8
3  6  1  7  5

In [66]: df6.columns
Out[66]: 
MultiIndex(levels=[[u'x', u'y'], [u'a', u'b', u'c']],
           labels=[[0, 1, 0, 1], [0, 0, 1, 1]])

In [67]: df6.columns.levels[1][pd.unique(df6.columns.labels[1])]
Out[67]: Index([u'a', u'b'], dtype='object')

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

this is defined behavior on how s MultiIndex works
you can see http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels if you think remove_unused_levels

@dfolch
Copy link
Author

dfolch commented Jun 11, 2017

I see now that there was a debate on this in #2770. remove_unused_levels is exactly what I needed. Thanks for implementing that @jreback.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

@dfolch great!

yeah its not automatic, but at least possible now (in an easy way).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

3 participants