Skip to content

MultiIndex smart repr fmt doesn't work well for complex examples #3347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Apr 14, 2013 · 15 comments · Fixed by #4935
Closed

MultiIndex smart repr fmt doesn't work well for complex examples #3347

ghost opened this issue Apr 14, 2013 · 15 comments · Fixed by #4935
Labels
Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@ghost
Copy link

ghost commented Apr 14, 2013

In [42]: labels = [('foo', 1),
    ...:  ('foo', 2),
    ...:  ('bar', 3),
    ...:  ('bar', 4)]
    ...: pd.MultiIndex.from_tuples(labels)
Out[42]: 
MultiIndex
[foo  1,      2, bar  3,      4]

hiding the repeated foo and bar is nice for small multindices, but
for more levels and more complex structures, it can becomes
harder to read. The blanks also keep their width, and the result
isn't very attractive when presented as a single line:

labels = [('foo', '2012-07-26T00:00:00', 'b5c2700'),
    ...:  ('foo', '2012-08-06T00:00:00', '900b2ca'),
    ...:  ('foo', '2012-08-15T00:00:00', '07f1ce0'),
    ...:  ('foo', '2012-09-25T00:00:00', '5c93e83'),
    ...:  ('foo', '2012-09-25T00:00:00', '9345bba')]
    ...: pd.MultiIndex.from_tuples(labels)

MultiIndex
[append_frame_single_homogenous  2012-07-26  b5c2700,                     2012-08-06  00b2ca,                                 2012-08-15  07f1ce0,                                 2012-09-25  5c93e83,                                             9345bba]
@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2013

Why not have it repr as if it were the columns of a DataFrame?

@ghost
Copy link
Author

ghost commented Apr 17, 2013

Interesting. Along with a couple of similar changes queued for 0.12 though,
I'd opt for having a repr() output be valid python code to recreate the object. That's
pythonic and useful.

If that sort of view is wanted, it's not difficult to convert the multindex tuples
into a dataframe explicitly.

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2013

@y-p Oh I meant as if it were the actual column index of a DataFrame, (e.g., a DataFrame with a column MultiIndex only just showing the column index), but I like your suggestion too.

I've done that in the past with DataFrame(np.row_stack(mi), columns=mi.names) where mi is a MultiIndex.

@hayd
Copy link
Contributor

hayd commented Apr 18, 2013

+1 for repr being valid and give back the output to recreate object.

In [1]: pd.Index([1, '2'])
Out[1]: Index([1, 2], dtype=object)

In [2]: pd.Index([1, 2])
Out[2]: Int64Index([1, 2], dtype=int64)

I suppose the possible caveat could be if the index is very long (the ...s are kind of useful?):

In [3]: pd.Index(xrange(100000))
Out[3]: Int64Index([0, 1, 2, ..., 99997, 99998, 99999], dtype=int64)

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@jtratner didn't we have a recent issue related to multi-index formatting? can you link it here (and possibly close this one if a dupe)

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2013

related to #3941

@jtratner
Copy link
Contributor

how about we just change the MultiIndex repr to just look like how it looks with columns?

So if you have a Series with a MultiIndex like this:

C0       R0
C_l0_g0  R_l0_g0    R0C0
         R_l0_g1    R1C0
         R_l0_g2    R2C0
         R_l0_g3    R3C0
         R_l0_g4    R4C0
C_l0_g1  R_l0_g0    R0C1
         R_l0_g1    R1C1
         R_l0_g2    R2C1
         R_l0_g3    R3C1
         R_l0_g4    R4C1
C_l0_g2  R_l0_g0    R0C2
         R_l0_g1    R1C2
         R_l0_g2    R2C2
         R_l0_g3    R3C2
         R_l0_g4    R4C2
C_l0_g3  R_l0_g0    R0C3
         R_l0_g1    R1C3
         R_l0_g2    R2C3
         R_l0_g3    R3C3
         R_l0_g4    R4C3
C_l0_g4  R_l0_g0    R0C4
         R_l0_g1    R1C4
         R_l0_g2    R2C4
         R_l0_g3    R3C4
         R_l0_g4    R4C4
dtype: object

then the MultiIndex repr should just be this (or maybe it's transpose?)

C0       R0
C_l0_g0  R_l0_g0
         R_l0_g1
         R_l0_g2
         R_l0_g3
         R_l0_g4
C_l0_g1  R_l0_g0
         R_l0_g1
         R_l0_g2
         R_l0_g3
         R_l0_g4
C_l0_g2  R_l0_g0
         R_l0_g1
         R_l0_g2
         R_l0_g3
         R_l0_g4
C_l0_g3  R_l0_g0
         R_l0_g1
         R_l0_g2
         R_l0_g3
         R_l0_g4
C_l0_g4  R_l0_g0
         R_l0_g1
         R_l0_g2
         R_l0_g3
         R_l0_g4

dtype: object

Way better.

@hayd
Copy link
Contributor

hayd commented Sep 22, 2013

Can we have a str/unicode as @jtratner suggests, but a repr which is valid python to recreate itself?

@jtratner
Copy link
Contributor

sure, but repr should just be this:

>>> mi = MultiIndex.from_arrays([[1, 1, 1, 1], [1, 3, 5, 7], [9, 9, 1, 1]])
>>> print repr(mi)
MultiIndex(levels=[[1], [1, 3, 5, 7], [1, 9]],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3], [1, 1, 0, 0]],
           names=[None, None, None],
           sortorder=None )

But probably wouldn't show the ones that are all None.

@jtratner
Copy link
Contributor

oh how nice, we can already just use MultiIndex.format to do this. and if it's beyond max rows, add a ...

So if max rows were 10, it'd be:

C0       R0
C_l0_g0  R_l0_g0
         R_l0_g1
         R_l0_g2
         R_l0_g3
         R_l0_g4
C_l0_g1  R_l0_g0
         R_l0_g1
         ...
         R_l0_g4
C_l0_g2  R_l0_g0

Are we going to do something special for HTML?

@jtratner
Copy link
Contributor

Got this working - yay!

In [1]: import pandas as pd

In [2]: pd.MultiIndex.from_arrays([[1, 1, 1, 1], [1, 3, 5, 7], [9, 9, 1, 1]])
Out[2]:
MultiIndex(levels=[[1], [1, 3, 5, 7], [1, 9]]
           labels=[[0, 0, 0, 0], [0, 1, 2, 3], [1, 1, 0, 0]])

In [3]: mi = _

In [4]: mi.names = list('abc')

In [5]: print mi
a  b  c
1  1  9
   3  9
   5  1
   7  1

And with too many rows:

In [10]: pd.set_option('display.max_rows', 15)

In [11]: lst1 = [1] * 3 + [2] * 5 + [3] * 2

In [12]: lst1
Out[12]: [1, 1, 1, 2, 2, 2, 2, 2, 3, 3]

In [13]: lst2 = ['a'] * 6 + ['b'] * 3 + ['c'] * 1

In [14]: lst2
Out[14]: ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'c']

In [15]: mi = pd.MultiIndex.from_arrays([lst1 * 10, lst2 * 10, range(100)]); mi
Out[15]:
MultiIndex(levels=[[1, 2, 3], [u'a', u'b', u'c'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]
           labels=[[0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2], [0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [16]: mi = _

In [17]: print mi

1  a  0
      1
      2
2  a  3
      4
      5
  ...
2  a  93
      94
      95
   b  96
      97
3  b  98
   c  99

@ghost ghost assigned jtratner Sep 22, 2013
@jtratner
Copy link
Contributor

Not sure how to handle too wide -- wrap? format() doesn't try to sparsify after a column with values everywhere.

In [19]: mi = pd.MultiIndex.from_arrays([lst1 * 10, lst2 * 10, range(100)] * 10)
In [20]: print mi

1  a  0   1  a  0   1  a  0   1  a  0   1  a  0   1  a  0   1  a  0   1  a  0   1  a  0   1  a  0
      1   1  a  1   1  a  1   1  a  1   1  a  1   1  a  1   1  a  1   1  a  1   1  a  1   1  a  1
      2   1  a  2   1  a  2   1  a  2   1  a  2   1  a  2   1  a  2   1  a  2   1  a  2   1  a  2
2  a  3   2  a  3   2  a  3   2  a  3   2  a  3   2  a  3   2  a  3   2  a  3   2  a  3   2  a  3
      4   2  a  4   2  a  4   2  a  4   2  a  4   2  a  4   2  a  4   2  a  4   2  a  4   2  a  4
      5   2  a  5   2  a  5   2  a  5   2  a  5   2  a  5   2  a  5   2  a  5   2  a  5   2  a  5
                                               ...
2  a  93  2  a  93  2  a  93  2  a  93  2  a  93  2  a  93  2  a  93  2  a  93  2  a  93  2  a  93
      94  2  a  94  2  a  94  2  a  94  2  a  94  2  a  94  2  a  94  2  a  94  2  a  94  2  a  94
      95  2  a  95  2  a  95  2  a  95  2  a  95  2  a  95  2  a  95  2  a  95  2  a  95  2  a  95
   b  96  2  b  96  2  b  96  2  b  96  2  b  96  2  b  96  2  b  96  2  b  96  2  b  96  2  b  96
      97  2  b  97  2  b  97  2  b  97  2  b  97  2  b  97  2  b  97  2  b  97  2  b  97  2  b  97
3  b  98  3  b  98  3  b  98  3  b  98  3  b  98  3  b  98  3  b  98  3  b  98  3  b  98  3  b  98
   c  99  3  c  99  3  c  99  3  c  99  3  c  99  3  c  99  3  c  99  3  c  99  3  c  99  3  c  99

@jtratner
Copy link
Contributor

And here's @y-p's more complicated thing:

In [9]: labels = [('foo', '2012-07-26T00:00:00', 'b5c2700'),
   ...: ('foo', '2012-08-06T00:00:00', '900b2ca'),
   ...: ('foo', '2012-08-15T00:00:00', '07f1ce0'),
   ...: ('foo', '2012-09-25T00:00:00', '5c93e83'),
   ...: ('foo', '2012-09-25T00:00:00', '9345bba')]

In [10]: print pd.MultiIndex.from_tuples(labels)

foo  2012-07-26T00:00:00  b5c2700
     2012-08-06T00:00:00  900b2ca
     2012-08-15T00:00:00  07f1ce0
     2012-09-25T00:00:00  5c93e83
                          9345bba

@naught101
Copy link

I don't know what happened to this, whether the solution is unrelated to my problem, but my dataframe like this:

In [44]: met_data
Out[44]: 
variable                        SWdown        Tair
time                site                          
2002-01-01 00:30:00 Tumba     0.000000  282.920013
2002-01-01 01:00:00 Tumba     0.000000  282.898590
2002-01-01 01:30:00 Tumba     0.000000  282.739990
2002-01-01 02:00:00 Tumba     0.000000  282.565002
2002-01-01 02:30:00 Tumba     0.000000  282.390015
2002-01-01 03:00:00 Tumba     0.000000  281.714996
2002-01-01 03:30:00 Tumba     0.000000  281.040009
2002-01-01 04:00:00 Tumba     0.000000  280.420013
2002-01-01 04:30:00 Tumba     0.000000  279.799988
2002-01-01 05:00:00 Tumba    14.580000  279.890015
2002-01-01 05:30:00 Tumba    29.160000  279.980011
2002-01-01 06:00:00 Tumba    97.154999  280.255005
2002-01-01 06:30:00 Tumba   165.149994  280.529999
2002-01-01 07:00:00 Tumba   328.265015  280.690002
2002-01-01 07:30:00 Tumba   491.380005  280.850006
2002-01-01 08:00:00 Tumba   487.464996  280.875000
2002-01-01 08:30:00 Tumba   483.549988  280.899994
2002-01-01 09:00:00 Tumba   568.200012  281.315002
2002-01-01 09:30:00 Tumba   652.849976  281.730011
2002-01-01 10:00:00 Tumba   644.500000  282.095001
2002-01-01 10:30:00 Tumba   636.150024  282.459991
2002-01-01 11:00:00 Tumba   669.989990  283.075012
2002-01-01 11:30:00 Tumba   703.830017  283.690002
2002-01-01 12:00:00 Tumba   749.554993  284.095001
2002-01-01 12:30:00 Tumba   795.280029  284.500000
2002-01-01 13:00:00 Tumba   644.255005  284.779999
2002-01-01 13:30:00 Tumba   493.230011  285.059998
2002-01-01 14:00:00 Tumba   322.489990  284.934998
2002-01-01 14:30:00 Tumba   151.750000  284.809998
2002-01-01 15:00:00 Tumba   125.099998  284.744995
...                                ...         ...
2005-12-31 09:30:00 Tumba   878.200012  298.390015
2005-12-31 10:00:00 Tumba   940.695007  299.079987
2005-12-31 10:30:00 Tumba  1003.190002  299.769989
2005-12-31 11:00:00 Tumba  1040.464966  300.190002
2005-12-31 11:30:00 Tumba  1077.739990  300.609985
2005-12-31 12:00:00 Tumba  1080.974976  301.005005
2005-12-31 12:30:00 Tumba  1084.209961  301.399994
2005-12-31 13:00:00 Tumba  1058.364990  301.834991
2005-12-31 13:30:00 Tumba  1032.520020  302.269989
2005-12-31 14:00:00 Tumba   978.469971  302.519989
2005-12-31 14:30:00 Tumba   924.419983  302.769989
2005-12-31 15:00:00 Tumba   838.005005  302.980011
2005-12-31 15:30:00 Tumba   751.590027  303.190002
2005-12-31 16:00:00 Tumba   649.864990  303.195007
2005-12-31 16:30:00 Tumba   548.140015  303.200012
2005-12-31 17:00:00 Tumba   441.904999  303.165009
2005-12-31 17:30:00 Tumba   335.670013  303.130005
2005-12-31 18:00:00 Tumba   215.830002  302.665009
2005-12-31 18:30:00 Tumba    95.989998  302.200012
2005-12-31 19:00:00 Tumba    49.535000  301.565002
2005-12-31 19:30:00 Tumba     3.080000  300.929993
2005-12-31 20:00:00 Tumba     1.540000  300.885010
2005-12-31 20:30:00 Tumba     0.000000  300.839996
2005-12-31 21:00:00 Tumba     0.000000  300.864990
2005-12-31 21:30:00 Tumba     0.000000  300.890015
2005-12-31 22:00:00 Tumba     0.000000  300.420013
2005-12-31 22:30:00 Tumba     0.000000  299.950012
2005-12-31 23:00:00 Tumba     0.000000  299.966675
2005-12-31 23:30:00 Tumba     0.000000  299.529999
2006-01-01 00:00:00 Tumba          NaN         NaN

[70128 rows x 2 columns]

In [41]: len(str(met_data.index))
Out[41]: 2162949

In [42]: print(str(met_data.index)[0:100], ' ... ', str(met_data.index)[-100:])
MultiIndex(levels=[[2002-01-01 00:30:00, 2002-01-01 01:00:00, 2002-01-01 01:30:00, 2002-01-01 02:00:  ...   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
           names=['time', 'site'])

In [43]: print(str(met_data.index)[0:200], '\n...\n', str(met_data.index)[-200:])
MultiIndex(levels=[[2002-01-01 00:30:00, 2002-01-01 01:00:00, 2002-01-01 01:30:00, 2002-01-01 02:00:00, 2002-01-01 02:30:00, 2002-01-01 03:00:00, 2002-01-01 03:30:00, 2002-01-01 04:00:00, 2002-01-01 0 
...
 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
           names=['time', 'site'])

2 million characters is a lot to print... it would be really nice to have a shorter representation.

@jreback
Copy link
Contributor

jreback commented May 15, 2017

@naught101 see #12423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants