pm.Data transforms its contents into floats inappropriately #3493

rpgoldman · 2019-05-24T17:11:10Z

Whatever the data type that it receives, the pm.Data constructor turns it into a float, using pm.model.pandas_to_array which, in turn, calls pm.floatX. @ferrine says there's a substitute function, pm.theanof.smartfloatX that would be appropriate to substitute.

I need to clear a couple of existing PRs before I can do anything with this. If anyone else would like to grab it in the meantime, that would be great.

Note that we need to check pm.set_data() to see if it does the same thing. I don't understand the other functions in pm.data.py that deal with DataFrames, etc., so it's possible something lurks in there, as well.

The text was updated successfully, but these errors were encountered:

rpgoldman · 2019-08-16T14:55:11Z

I'm assigning this to me, because I think this is fixed per @ferrine 's suggestion, but I need to check.

jmloyola · 2019-08-27T20:02:04Z

Can I give it a try?

I think if we want to support different types of variables for pm.Data as in the observed random variables, we should change the function pm.model.pandas_to_array to do so.
We could add a flag to the function like force_cast=True and later do return pm.floatX(ret) if force_cast else pm.smartfloatX(ret). What do you think?

lucianopaz · 2019-08-27T20:21:56Z

@jmloyola, yes, feel free to give it a try. As you said, we only need to look at pm.model.pandas_to_array. My suggestion is to always cast either to floatX or to intX. We would only need to know which of the two to cast to.

rpgoldman · 2019-08-27T20:32:11Z

That sounds fine, but I think it would be best if force_cast defaulted to False, rather than True, so that we get smartfloatX instead of the draconian floatX by default.
The naive user of pm.Data won't expect their integers to magically turn into floats...
I can't imagine a case where I would want floatX instead of smartfloatX, but probably that's just me.
If you are going to go to the trouble of making a force_cast, why not let the programmer specify what type they want, instead of limiting them to "what smartfloatX will do or what floatX will do"?
In that case it might be better to just have dtype instead of force_cast.

lucianopaz · 2019-08-27T20:40:16Z

I never meant to convert integers to floats. That's exactly the issue raised in this discourse thread. What we would want is to cast any float like to floatX and any int like into intX

jmloyola · 2019-08-27T20:57:53Z

The user does not use the function pm.model.pandas_to_array, it is only used internally by PyMC (as far as I know). Currently this function always casts to floatX, that is why I would put force_cast=True as default.
What I propose is to add the flag to consider the case for pm.Data being of different type.
In this way, we only need to change the line:

shared_object = theano.shared(pm.model.pandas_to_array(value), name)

to

shared_object = theano.shared(pm.model.pandas_to_array(value, force_cast=False), name)

for example. Same situation with this line.

The already existent calls to the function pandas_to_array will remain unchanged since they will still cast to floatX.

Does this make sense?

rpgoldman · 2019-08-27T21:50:34Z

I thought I had swapped the translation to use smartfloatX instead of floatX. I can't say that I recall. But I do recall having a case where a pm.Data full of array indices got magically turned into floats. That was not a happy moment for me.... So when you say

The already existent calls to the function pandas_to_array will remain unchanged since they will still cast to floatX.

I think that may not be a change, but it also may not be a change from The Wrong Thing. Having integers turn into floats seems like a Bad Thing, even if it is the current behavior. I'd prefer to see that particular current behavior change. I can't think of a case where any working code that works with floatX would not work with smartfloatX, unless someone is counting on integers turning into floats, and that seems like a bad thing to do.

Actually, maybe that didn't change -- I see a bunch of places where I constructed Data objects using .to_numpy(dtype="int").

jmloyola · 2019-08-28T12:15:57Z

In the code base, pandas_to_array it is used to transform the data in the class ObservedRV. Thus, I didn't want to change the function to always use smartfloatX because I was not sure how it is used by the users.
But in the other hand,

I can't think of a case where any working code that works with floatX would not work with smartfloatX, unless someone is counting on integers turning into floats, and that seems like a bad thing to do.

I agree with you in this.

I will change the return of the function to return pm.smartfloatX(ret) and I will run the tests locally on my machine to see if they pass. If that is the case I will make a pull request with the changes. Can you help me reviewing it?

If you think of a case that this might fail or a model you have run that didn't work previously (or that you had to .to_numpy(dtype="int")), please copy it here so I can test it. Thank you for the comments 😄.

rpgoldman · 2019-08-28T14:35:04Z

I don't have a model that's easy to pull out of context (it's in the middle of a big pipeline). But my model's inspiration came from one by @junpenglao : Code 10 Schizophrenic case study. This one has variables like subj_id, which (I may be forgetting the details) act as array indices. They are fixed in his model, but when I put this idea into my model, I made the lookup variables be pm.Data, because my subjects' IDs could change between uses of the model.
So I think it would be possible to take this variable, and maybe some others in that notebook, and put them inside pm.Data. If I am right, that should show the problem.
Let me know if that isn't clear, and I'll try to be more explicit.

jmloyola · 2019-08-29T20:28:12Z

I understood. I think the model is similar to this one. I checked it and the proposal we have (changing the return to pm.smartfloatX(ret)) works.
The only problem is that the test test_model_helpers fails in the generator case:

func = pm.model.pandas_to_array
...
generator_output = func(square_generator)
wrapped = generator_output.owner.inputs[0] # IndexError: list index out of range

This is related to the fact that this line in the function pandas_to_array returns a variable of type pymc3.data.GenTensorVariable which inherits from the theano class Variable.
Finally, the function pandas_to_array tries to cast that to floatX. As we used pm.smartfloatX(ret) this depends on the type of ret:

if str(ret.dtype).startswith('float') is True the function casts it to floatX. Then instead of returning something of type pymc3.data.GenTensorVariable it returns something of type theano.tensor.var.TensorVariable. This is because we applied an operation to the theano variable. Now to get the GenTensorVariable we have to use ret.owner.inputs[0] (in the same way as it is in the test).
if str(ret.dtype).startswith('float') is False the function does not cast it to floatX. Then it returns something of type pymc3.data.GenTensorVariable.

The same error can happen with the current code of PyMC if in the test we change this line to:

# The only difference is the dtype
square_generator = (np.array([i*2.0], dtype=float) for i in range(100))

This happens because the function pm.floatX does not need to cast something that it is already of the type desired. So, it returns something of type pymc3.data.GenTensorVariable.

To summarize, the function pandas_to_array sometimes returns a variable of type pymc3.data.GenTensorVariable and sometimes of type theano.tensor.var.TensorVariable. Is this behaviour intended? Can it cause problems? @ferrine @ColCarroll

I will start a pull request with these changes so more people can chip in. Maybe I am overthinking this 😅.

AlexAndorra · 2020-05-26T14:31:16Z

Closing as this was taken care of by #3925

NowanIlfideme · 2020-06-03T11:15:27Z

I just ran into this problem as well. For folks wanting to use a temporary fix, this class will work to BUILD the model and sample, though I assume pm.set_data() won't work:

import theano

class IntData(pm.Data):
    
    def __new__(self, name, value):
        if isinstance(value, list):
            value = np.array(value)

        # Add data container to the named variables of the model.
        try:
            model = pm.Model.get_context()
        except TypeError:
            raise TypeError(
                "No model on context stack, which is needed to instantiate a data container. "
                "Add variable inside a 'with model:' block."
            )
        name = model.name_for(name)

        # `pm.model.pandas_to_array` takes care of parameter `value` and
        # transforms it to something digestible for pymc3
        _val = pm.model.pandas_to_array(value)
        _val = (_val).astype('int64')
        shared_object = theano.shared(_val, name)

        # To draw the node for this variable in the graphviz Digraph we need
        # its shape.
        shared_object.dshape = tuple(shared_object.shape.eval())

        model.add_random_variable(shared_object)

        return shared_object

If you want to do something with set_data() pre-patch, then you'll need to monkey-patch this function in:

https://github.com./pymc-devs/pymc3/pull/3925/files#diff-1f25198dab6a35cd27fffe043b7e1b9dL1485

AlexAndorra · 2020-06-03T11:22:52Z

Hi Anatoly,
Thanks for the suggestion. Are you running on PyMC master though? This should already be fixed by the PR I mentionned above, which allows to use both pm.Data and pm.set_data.
This new feature will be shiped with PyMC 3.9 🖖

NowanIlfideme · 2020-06-03T11:28:02Z

Yes, sorry if I didn't make it clear - I'm running 3.8 (can't use a master version), but my temp fix is for folks who need to run "in production" RIGHT NOW and can't use "unstable versions" (or wait until 3.9 release 😉).

Thanks for your work!

AlexAndorra · 2020-06-03T14:08:07Z

Ow ok, thanks for the clarification!
FWIW, 3.9 should come soon 😉

aseyboldt self-assigned this Jul 22, 2019

rpgoldman self-assigned this Aug 16, 2019

jmloyola mentioned this issue Aug 30, 2019

Do not cast the content of pm.Data to float inappropriately #3611

Closed

AlexAndorra closed this as completed May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pm.Data transforms its contents into floats inappropriately #3493

pm.Data transforms its contents into floats inappropriately #3493

rpgoldman commented May 24, 2019

rpgoldman commented Aug 16, 2019

jmloyola commented Aug 27, 2019

lucianopaz commented Aug 27, 2019

rpgoldman commented Aug 27, 2019

lucianopaz commented Aug 27, 2019

jmloyola commented Aug 27, 2019

rpgoldman commented Aug 27, 2019 •

edited

Loading

jmloyola commented Aug 28, 2019

rpgoldman commented Aug 28, 2019

jmloyola commented Aug 29, 2019

AlexAndorra commented May 26, 2020

NowanIlfideme commented Jun 3, 2020

AlexAndorra commented Jun 3, 2020

NowanIlfideme commented Jun 3, 2020 •

edited

Loading

AlexAndorra commented Jun 3, 2020

pm.Data transforms its contents into floats inappropriately #3493

pm.Data transforms its contents into floats inappropriately #3493

Comments

rpgoldman commented May 24, 2019

rpgoldman commented Aug 16, 2019

jmloyola commented Aug 27, 2019

lucianopaz commented Aug 27, 2019

rpgoldman commented Aug 27, 2019

lucianopaz commented Aug 27, 2019

jmloyola commented Aug 27, 2019

rpgoldman commented Aug 27, 2019 • edited Loading

jmloyola commented Aug 28, 2019

rpgoldman commented Aug 28, 2019

jmloyola commented Aug 29, 2019

AlexAndorra commented May 26, 2020

NowanIlfideme commented Jun 3, 2020

AlexAndorra commented Jun 3, 2020

NowanIlfideme commented Jun 3, 2020 • edited Loading

AlexAndorra commented Jun 3, 2020

rpgoldman commented Aug 27, 2019 •

edited

Loading

NowanIlfideme commented Jun 3, 2020 •

edited

Loading