Skip to content

Phantom Index when NaN in MultiIndex #5286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cancan101 opened this issue Oct 20, 2013 · 13 comments · Fixed by #30444
Closed

Phantom Index when NaN in MultiIndex #5286

cancan101 opened this issue Oct 20, 2013 · 13 comments · Fixed by #30444
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@cancan101
Copy link
Contributor

Using the following:

csv_str = """u1,u2,u3,d1,d2,d3,d4
Good Things,C,,1,1,1,1
Good Things,R,,1,1,1,1
Bad Things,C,,1,1,1,1
Bad Things,T,,1,1,1,1
Okay Things,N,B,1,1,1,1
Okay Things,N,D,1,1,1,1
Okay Things,B,,1,1,1,1
Okay Things,D,,1,1,1,1"""
d_nan = pd.read_csv(StringIO.StringIO(csv_str), na_filter=False,index_col=[0,1,2])
d_no_nan = pd.read_csv(StringIO.StringIO(csv_str), na_filter=False,index_col=[0,1,2],na_values=[], keep_default_na=False)

If I have the following DataFrame:

In [63]: d_nan
Out[63]: 
                    d1  d2  d3  d4
u1          u2 u3                 
Good Things C  NaN   1   1   1   1
            R  NaN   1   1   1   1
Bad Things  C  NaN   1   1   1   1
            T  NaN   1   1   1   1
Okay Things N  B     1   1   1   1
               D     1   1   1   1
            B  NaN   1   1   1   1
            D  NaN   1   1   1   1

and then I try to use loc on the index:

In [64]: d_nan.loc["Good Things"]
Out[64]: 
        d1  d2  d3  d4
u2 u3                 
C  NaN   1   1   1   1
R  NaN   1   1   1   1

When I try to loc the last level, a mysterious D appears:

In [65]: d_nan.loc["Good Things"].loc['C']
Out[65]: 
    d1  d2  d3  d4
u3                
D    1   1   1   1

if d instead looks like (no NaNs):

In [67]: d_no_nan
Out[67]: 
                   d1  d2  d3  d4
u1          u2 u3                
Good Things C       1   1   1   1
            R       1   1   1   1
Bad Things  C       1   1   1   1
            T       1   1   1   1
Okay Things N  B    1   1   1   1
               D    1   1   1   1
            B       1   1   1   1
            D       1   1   1   1

then I have no problems:

In [68]: d_no_nan.loc["Good Things"].loc['C']
Out[68]: 
    d1  d2  d3  d4
u3                
     1   1   1   1
@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

pls post the creation code when u posts a problem so that it can be copy-pasted to reproduce

it makes it much easier to see if their is actually a bug

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

pls post it in a format which is easily copy pasted

iow, put this in a string and use StringIO to read it in (via csv)

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

gr8

pls also edit the question, so its a simple copy paste within ipython. you will get much faster responses this way.

its best to make it easy as possible for readers of this thread. no one wants to wade thru many answers/responses just to see what the problem is.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

I had spent some time on trying to get this work (previously, not related to this issue).

The basic problem is since the prescent of more than one nan makes the indes non-unique and quite hard to deal with. This should probably throw an error (not actually constructing it, but selecting from it). It IS useful to be able to have the nan's there during reshaping operations), e.g. imagine that you have nans in a column and you set_index, then reset, you'd expect it to work.

So'll i'll mark this as a bug for 0.14, but prob just to raise an error. that said their are some specific cases that DO work, e.g.

In [9]: df = DataFrame({"a":['R1', 'R2', np.nan, 'R4'], 'b':["C1", "C2", "C3" , "C4"], "c":[10, 15, np.nan , 20]})

In [10]: df_nan = df.set_index(['a','b'],drop=False)

In [11]: df_nan
Out[11]: 
          a   b   c
a   b              
R1  C1   R1  C1  10
R2  C2   R2  C2  15
NaN C3  NaN  C3 NaN
R4  C4   R4  C4  20

In [12]: df_nan.loc[(np.nan,'C3')]
Out[12]: 
a    NaN
b     C3
c    NaN
Name: (nan, C3), dtype: object

This raises an error (in theory I can get this to work, but once you add another nan its very problematic)

In [13]: df_nan.loc[np.nan]
ValueError: cannot use label indexing with a null key

@ghost ghost assigned jreback Oct 20, 2013
@cancan101
Copy link
Contributor Author

The problem that I am trying to deal with is data that has a ragged hierarchy. By that I mean it is hierarchical data, but not all hierarchies have the same depth.

An example can be seen looking at an income statement.
One data might be revenue/cost->division->subdivision.
Other would be shares outstanding->basic/diluted. The first has 3 levels, the second only two. You could argue that these should be stored in separate frames, which might be right, but initially they are all coming from one source (CSV, Excel, HTML, XBRL, etc).

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

I would just always fill the index labels with a string to avoid ambiguity. Indexes having NaN are by definition pretty ambiguous.

@jtratner
Copy link
Contributor

It might be easier to work with if you put the data in the Index as columns
rather than an MI, probably more straightforward + you could use the new
query methods in 0.13

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

IMHO you are also better off served NOT using a single frame and trying to shove everything into it. Yes it can work, but most often you need specific behavior. I find that creating an object and have an included DataFrame (an has-a) is a very nice idiom for holding data.

Just because data comes from a single source does not mean that the storage is the right medium. I often serialize in a very flat way because its more efficient/easier; but that does not mean that that is the most 'natural' way.

@cancan101
Copy link
Contributor Author

Which raises another question. Let's say I do have the table called d_nan above.

How do I replace the NaNs in the index with another value (in this case I would imagine some monotonic int or string would be ideal).

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

In [5]: d_nan
Out[5]: 
                    d1  d2  d3  d4
u1          u2 u3                 
Good Things C  NaN   1   1   1   1
            R  NaN   1   1   1   1
Bad Things  C  NaN   1   1   1   1
            T  NaN   1   1   1   1
Okay Things N  B     1   1   1   1
               D     1   1   1   1
            B  NaN   1   1   1   1
            D  NaN   1   1   1   1

In [7]: df = d_nan.reset_index()

In [9]: df['u3'] = df['u3'].fillna('_nan_')

IIn [11]: df.set_index(['u1','u2','u3'],inplace=True)

In [12]: df
Out[12]: 
                      d1  d2  d3  d4
u1          u2 u3                   
Good Things C  _nan_   1   1   1   1
            R  _nan_   1   1   1   1
Bad Things  C  _nan_   1   1   1   1
            T  _nan_   1   1   1   1
Okay Things N  B       1   1   1   1
               D       1   1   1   1
            B  _nan_   1   1   1   1
            D  _nan_   1   1   1   1

@cancan101
Copy link
Contributor Author

Okay that works. I think I will modify it to use the value in u2 rather than a fixed string. (i.e. just copy it for for missing values).

@mroeschke
Copy link
Member

As of 0.19.1, this original issue seems to have been fixed on master:

In [15]: d_nan.loc[("Good Things","C")]
/home/matt/anaconda/envs/pandas_dev/bin/ipython:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  #!/home/matt/anaconda/envs/pandas_dev/bin/python'

Out[15]: 
     d1  d2  d3  d4
u3                 
NaN   1   1   1   1

In [16]: d_no_nan.loc[("Good Things","C")]
/home/matt/anaconda/envs/pandas_dev/bin/ipython:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  #!/home/matt/anaconda/envs/pandas_dev/bin/python
Out[16]: 
    d1  d2  d3  d4
u3                
     1   1   1   1

Should an error still be thrown in this case, or should a test be added to confirm this behavior?

@toobaz
Copy link
Member

toobaz commented May 18, 2018

Should an error still be thrown in this case, or should a test be added to confirm this behavior?

I think this just needs a test.

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite good first issue and removed Error Reporting Incorrect or improved errors from pandas labels May 18, 2018
@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Testing pandas testing functions or related to the test suite labels Oct 6, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Dec 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants