Phantom Index when NaN in MultiIndex #5286

cancan101 · 2013-10-20T21:07:01Z

Using the following:

csv_str = """u1,u2,u3,d1,d2,d3,d4
Good Things,C,,1,1,1,1
Good Things,R,,1,1,1,1
Bad Things,C,,1,1,1,1
Bad Things,T,,1,1,1,1
Okay Things,N,B,1,1,1,1
Okay Things,N,D,1,1,1,1
Okay Things,B,,1,1,1,1
Okay Things,D,,1,1,1,1"""
d_nan = pd.read_csv(StringIO.StringIO(csv_str), na_filter=False,index_col=[0,1,2])
d_no_nan = pd.read_csv(StringIO.StringIO(csv_str), na_filter=False,index_col=[0,1,2],na_values=[], keep_default_na=False)

If I have the following DataFrame:

In [63]: d_nan
Out[63]: 
                    d1  d2  d3  d4
u1          u2 u3                 
Good Things C  NaN   1   1   1   1
            R  NaN   1   1   1   1
Bad Things  C  NaN   1   1   1   1
            T  NaN   1   1   1   1
Okay Things N  B     1   1   1   1
               D     1   1   1   1
            B  NaN   1   1   1   1
            D  NaN   1   1   1   1

and then I try to use loc on the index:

In [64]: d_nan.loc["Good Things"]
Out[64]: 
        d1  d2  d3  d4
u2 u3                 
C  NaN   1   1   1   1
R  NaN   1   1   1   1

When I try to loc the last level, a mysterious D appears:

In [65]: d_nan.loc["Good Things"].loc['C']
Out[65]: 
    d1  d2  d3  d4
u3                
D    1   1   1   1

if d instead looks like (no NaNs):

In [67]: d_no_nan
Out[67]: 
                   d1  d2  d3  d4
u1          u2 u3                
Good Things C       1   1   1   1
            R       1   1   1   1
Bad Things  C       1   1   1   1
            T       1   1   1   1
Okay Things N  B    1   1   1   1
               D    1   1   1   1
            B       1   1   1   1
            D       1   1   1   1

then I have no problems:

In [68]: d_no_nan.loc["Good Things"].loc['C']
Out[68]: 
    d1  d2  d3  d4
u3                
     1   1   1   1

The text was updated successfully, but these errors were encountered:

jreback · 2013-10-20T21:26:08Z

pls post the creation code when u posts a problem so that it can be copy-pasted to reproduce

it makes it much easier to see if their is actually a bug

jreback · 2013-10-20T21:31:18Z

pls post it in a format which is easily copy pasted

iow, put this in a string and use StringIO to read it in (via csv)

jreback · 2013-10-20T21:38:15Z

gr8

pls also edit the question, so its a simple copy paste within ipython. you will get much faster responses this way.

its best to make it easy as possible for readers of this thread. no one wants to wade thru many answers/responses just to see what the problem is.

jreback · 2013-10-20T22:11:38Z

I had spent some time on trying to get this work (previously, not related to this issue).

The basic problem is since the prescent of more than one nan makes the indes non-unique and quite hard to deal with. This should probably throw an error (not actually constructing it, but selecting from it). It IS useful to be able to have the nan's there during reshaping operations), e.g. imagine that you have nans in a column and you set_index, then reset, you'd expect it to work.

So'll i'll mark this as a bug for 0.14, but prob just to raise an error. that said their are some specific cases that DO work, e.g.

In [9]: df = DataFrame({"a":['R1', 'R2', np.nan, 'R4'], 'b':["C1", "C2", "C3" , "C4"], "c":[10, 15, np.nan , 20]})

In [10]: df_nan = df.set_index(['a','b'],drop=False)

In [11]: df_nan
Out[11]: 
          a   b   c
a   b              
R1  C1   R1  C1  10
R2  C2   R2  C2  15
NaN C3  NaN  C3 NaN
R4  C4   R4  C4  20

In [12]: df_nan.loc[(np.nan,'C3')]
Out[12]: 
a    NaN
b     C3
c    NaN
Name: (nan, C3), dtype: object

This raises an error (in theory I can get this to work, but once you add another nan its very problematic)

In [13]: df_nan.loc[np.nan]
ValueError: cannot use label indexing with a null key

cancan101 · 2013-10-20T22:16:44Z

The problem that I am trying to deal with is data that has a ragged hierarchy. By that I mean it is hierarchical data, but not all hierarchies have the same depth.

An example can be seen looking at an income statement.
One data might be revenue/cost->division->subdivision.
Other would be shares outstanding->basic/diluted. The first has 3 levels, the second only two. You could argue that these should be stored in separate frames, which might be right, but initially they are all coming from one source (CSV, Excel, HTML, XBRL, etc).

jreback · 2013-10-20T22:19:43Z

I would just always fill the index labels with a string to avoid ambiguity. Indexes having NaN are by definition pretty ambiguous.

jtratner · 2013-10-20T22:22:02Z

It might be easier to work with if you put the data in the Index as columns
rather than an MI, probably more straightforward + you could use the new
query methods in 0.13

jreback · 2013-10-20T22:22:12Z

IMHO you are also better off served NOT using a single frame and trying to shove everything into it. Yes it can work, but most often you need specific behavior. I find that creating an object and have an included DataFrame (an has-a) is a very nice idiom for holding data.

Just because data comes from a single source does not mean that the storage is the right medium. I often serialize in a very flat way because its more efficient/easier; but that does not mean that that is the most 'natural' way.

cancan101 · 2013-10-20T22:22:24Z

Which raises another question. Let's say I do have the table called d_nan above.

How do I replace the NaNs in the index with another value (in this case I would imagine some monotonic int or string would be ideal).

jreback · 2013-10-20T22:25:41Z

In [5]: d_nan
Out[5]: 
                    d1  d2  d3  d4
u1          u2 u3                 
Good Things C  NaN   1   1   1   1
            R  NaN   1   1   1   1
Bad Things  C  NaN   1   1   1   1
            T  NaN   1   1   1   1
Okay Things N  B     1   1   1   1
               D     1   1   1   1
            B  NaN   1   1   1   1
            D  NaN   1   1   1   1

In [7]: df = d_nan.reset_index()

In [9]: df['u3'] = df['u3'].fillna('_nan_')

IIn [11]: df.set_index(['u1','u2','u3'],inplace=True)

In [12]: df
Out[12]: 
                      d1  d2  d3  d4
u1          u2 u3                   
Good Things C  _nan_   1   1   1   1
            R  _nan_   1   1   1   1
Bad Things  C  _nan_   1   1   1   1
            T  _nan_   1   1   1   1
Okay Things N  B       1   1   1   1
               D       1   1   1   1
            B  _nan_   1   1   1   1
            D  _nan_   1   1   1   1

cancan101 · 2013-10-20T22:27:22Z

Okay that works. I think I will modify it to use the value in u2 rather than a fixed string. (i.e. just copy it for for missing values).

mroeschke · 2016-12-10T08:16:16Z

As of 0.19.1, this original issue seems to have been fixed on master:

In [15]: d_nan.loc[("Good Things","C")]
/home/matt/anaconda/envs/pandas_dev/bin/ipython:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  #!/home/matt/anaconda/envs/pandas_dev/bin/python'

Out[15]: 
     d1  d2  d3  d4
u3                 
NaN   1   1   1   1

In [16]: d_no_nan.loc[("Good Things","C")]
/home/matt/anaconda/envs/pandas_dev/bin/ipython:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  #!/home/matt/anaconda/envs/pandas_dev/bin/python
Out[16]: 
    d1  d2  d3  d4
u3                
     1   1   1   1

Should an error still be thrown in this case, or should a test be added to confirm this behavior?

toobaz · 2018-05-18T07:40:07Z

Should an error still be thrown in this case, or should a test be added to confirm this behavior?

I think this just needs a test.

ghost assigned jreback Oct 20, 2013

jreback mentioned this issue Jan 9, 2014

BUG: has_duplicates misbehaves when multiindex has a NaN #5873

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

jreback mentioned this issue Jun 25, 2016

NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

Closed

toobaz mentioned this issue Jan 8, 2018

pd.MultiIndex.get_loc(np.nan) #19132

Closed

mroeschke added Testing pandas testing functions or related to the test suite good first issue and removed Error Reporting Incorrect or improved errors from pandas labels May 18, 2018

mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Testing pandas testing functions or related to the test suite labels Oct 6, 2019

simonjayhawkins removed the MultiIndex label Dec 11, 2019

simonjayhawkins unassigned jreback Dec 11, 2019

mroeschke mentioned this issue Dec 24, 2019

TST: Add tests for fixed issues #30444

Merged

6 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Dec 26, 2019

jreback closed this as completed in #30444 Dec 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phantom Index when NaN in MultiIndex #5286

Phantom Index when NaN in MultiIndex #5286

cancan101 commented Oct 20, 2013

jreback commented Oct 20, 2013

jreback commented Oct 20, 2013

jreback commented Oct 20, 2013

jreback commented Oct 20, 2013

cancan101 commented Oct 20, 2013

jreback commented Oct 20, 2013

jtratner commented Oct 20, 2013

jreback commented Oct 20, 2013

cancan101 commented Oct 20, 2013

jreback commented Oct 20, 2013

cancan101 commented Oct 20, 2013

mroeschke commented Dec 10, 2016

toobaz commented May 18, 2018

Phantom Index when NaN in MultiIndex #5286

Phantom Index when NaN in MultiIndex #5286

Comments

cancan101 commented Oct 20, 2013

jreback commented Oct 20, 2013

jreback commented Oct 20, 2013

jreback commented Oct 20, 2013

jreback commented Oct 20, 2013

cancan101 commented Oct 20, 2013

jreback commented Oct 20, 2013

jtratner commented Oct 20, 2013

jreback commented Oct 20, 2013

cancan101 commented Oct 20, 2013

jreback commented Oct 20, 2013

cancan101 commented Oct 20, 2013

mroeschke commented Dec 10, 2016

toobaz commented May 18, 2018