Skip to content

pm.Data transforms its contents into floats inappropriately #3493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rpgoldman opened this issue May 24, 2019 · 15 comments
Closed

pm.Data transforms its contents into floats inappropriately #3493

rpgoldman opened this issue May 24, 2019 · 15 comments
Assignees

Comments

@rpgoldman
Copy link
Contributor

Whatever the data type that it receives, the pm.Data constructor turns it into a float, using pm.model.pandas_to_array which, in turn, calls pm.floatX. @ferrine says there's a substitute function, pm.theanof.smartfloatX that would be appropriate to substitute.

I need to clear a couple of existing PRs before I can do anything with this. If anyone else would like to grab it in the meantime, that would be great.

Note that we need to check pm.set_data() to see if it does the same thing. I don't understand the other functions in pm.data.py that deal with DataFrames, etc., so it's possible something lurks in there, as well.

@aseyboldt aseyboldt self-assigned this Jul 22, 2019
@rpgoldman rpgoldman self-assigned this Aug 16, 2019
@rpgoldman
Copy link
Contributor Author

I'm assigning this to me, because I think this is fixed per @ferrine 's suggestion, but I need to check.

@jmloyola
Copy link
Contributor

Can I give it a try?

I think if we want to support different types of variables for pm.Data as in the observed random variables, we should change the function pm.model.pandas_to_array to do so.
We could add a flag to the function like force_cast=True and later do return pm.floatX(ret) if force_cast else pm.smartfloatX(ret). What do you think?

@lucianopaz
Copy link
Contributor

@jmloyola, yes, feel free to give it a try. As you said, we only need to look at pm.model.pandas_to_array. My suggestion is to always cast either to floatX or to intX. We would only need to know which of the two to cast to.

@rpgoldman
Copy link
Contributor Author

That sounds fine, but I think it would be best if force_cast defaulted to False, rather than True, so that we get smartfloatX instead of the draconian floatX by default.
The naive user of pm.Data won't expect their integers to magically turn into floats...
I can't imagine a case where I would want floatX instead of smartfloatX, but probably that's just me.
If you are going to go to the trouble of making a force_cast, why not let the programmer specify what type they want, instead of limiting them to "what smartfloatX will do or what floatX will do"?
In that case it might be better to just have dtype instead of force_cast.

@lucianopaz
Copy link
Contributor

I never meant to convert integers to floats. That's exactly the issue raised in this discourse thread. What we would want is to cast any float like to floatX and any int like into intX

@jmloyola
Copy link
Contributor

The user does not use the function pm.model.pandas_to_array, it is only used internally by PyMC (as far as I know). Currently this function always casts to floatX, that is why I would put force_cast=True as default.
What I propose is to add the flag to consider the case for pm.Data being of different type.
In this way, we only need to change the line:

shared_object = theano.shared(pm.model.pandas_to_array(value), name)

to

shared_object = theano.shared(pm.model.pandas_to_array(value, force_cast=False), name)

for example. Same situation with this line.

The already existent calls to the function pandas_to_array will remain unchanged since they will still cast to floatX.

Does this make sense?

@rpgoldman
Copy link
Contributor Author

rpgoldman commented Aug 27, 2019

I thought I had swapped the translation to use smartfloatX instead of floatX. I can't say that I recall. But I do recall having a case where a pm.Data full of array indices got magically turned into floats. That was not a happy moment for me.... So when you say

The already existent calls to the function pandas_to_array will remain unchanged since they will still cast to floatX.

I think that may not be a change, but it also may not be a change from The Wrong Thing. Having integers turn into floats seems like a Bad Thing, even if it is the current behavior. I'd prefer to see that particular current behavior change. I can't think of a case where any working code that works with floatX would not work with smartfloatX, unless someone is counting on integers turning into floats, and that seems like a bad thing to do.

Actually, maybe that didn't change -- I see a bunch of places where I constructed Data objects using .to_numpy(dtype="int").

@jmloyola
Copy link
Contributor

In the code base, pandas_to_array it is used to transform the data in the class ObservedRV. Thus, I didn't want to change the function to always use smartfloatX because I was not sure how it is used by the users.
But in the other hand,

I can't think of a case where any working code that works with floatX would not work with smartfloatX, unless someone is counting on integers turning into floats, and that seems like a bad thing to do.

I agree with you in this.

I will change the return of the function to return pm.smartfloatX(ret) and I will run the tests locally on my machine to see if they pass. If that is the case I will make a pull request with the changes. Can you help me reviewing it?

If you think of a case that this might fail or a model you have run that didn't work previously (or that you had to .to_numpy(dtype="int")), please copy it here so I can test it. Thank you for the comments 😄.

@rpgoldman
Copy link
Contributor Author

I don't have a model that's easy to pull out of context (it's in the middle of a big pipeline). But my model's inspiration came from one by @junpenglao : Code 10 Schizophrenic case study. This one has variables like subj_id, which (I may be forgetting the details) act as array indices. They are fixed in his model, but when I put this idea into my model, I made the lookup variables be pm.Data, because my subjects' IDs could change between uses of the model.
So I think it would be possible to take this variable, and maybe some others in that notebook, and put them inside pm.Data. If I am right, that should show the problem.
Let me know if that isn't clear, and I'll try to be more explicit.

@jmloyola
Copy link
Contributor

I understood. I think the model is similar to this one. I checked it and the proposal we have (changing the return to pm.smartfloatX(ret)) works.
The only problem is that the test test_model_helpers fails in the generator case:

func = pm.model.pandas_to_array
...
generator_output = func(square_generator)
wrapped = generator_output.owner.inputs[0] # IndexError: list index out of range

This is related to the fact that this line in the function pandas_to_array returns a variable of type pymc3.data.GenTensorVariable which inherits from the theano class Variable.
Finally, the function pandas_to_array tries to cast that to floatX. As we used pm.smartfloatX(ret) this depends on the type of ret:

  • if str(ret.dtype).startswith('float') is True the function casts it to floatX. Then instead of returning something of type pymc3.data.GenTensorVariable it returns something of type theano.tensor.var.TensorVariable. This is because we applied an operation to the theano variable. Now to get the GenTensorVariable we have to use ret.owner.inputs[0] (in the same way as it is in the test).
  • if str(ret.dtype).startswith('float') is False the function does not cast it to floatX. Then it returns something of type pymc3.data.GenTensorVariable.

The same error can happen with the current code of PyMC if in the test we change this line to:

# The only difference is the dtype
square_generator = (np.array([i*2.0], dtype=float) for i in range(100))

This happens because the function pm.floatX does not need to cast something that it is already of the type desired. So, it returns something of type pymc3.data.GenTensorVariable.

To summarize, the function pandas_to_array sometimes returns a variable of type pymc3.data.GenTensorVariable and sometimes of type theano.tensor.var.TensorVariable. Is this behaviour intended? Can it cause problems? @ferrine @ColCarroll

I will start a pull request with these changes so more people can chip in. Maybe I am overthinking this 😅.

@AlexAndorra
Copy link
Contributor

Closing as this was taken care of by #3925

@NowanIlfideme
Copy link

I just ran into this problem as well. For folks wanting to use a temporary fix, this class will work to BUILD the model and sample, though I assume pm.set_data() won't work:

import theano

class IntData(pm.Data):
    
    def __new__(self, name, value):
        if isinstance(value, list):
            value = np.array(value)

        # Add data container to the named variables of the model.
        try:
            model = pm.Model.get_context()
        except TypeError:
            raise TypeError(
                "No model on context stack, which is needed to instantiate a data container. "
                "Add variable inside a 'with model:' block."
            )
        name = model.name_for(name)

        # `pm.model.pandas_to_array` takes care of parameter `value` and
        # transforms it to something digestible for pymc3
        _val = pm.model.pandas_to_array(value)
        _val = (_val).astype('int64')
        shared_object = theano.shared(_val, name)

        # To draw the node for this variable in the graphviz Digraph we need
        # its shape.
        shared_object.dshape = tuple(shared_object.shape.eval())

        model.add_random_variable(shared_object)

        return shared_object

If you want to do something with set_data() pre-patch, then you'll need to monkey-patch this function in:

https://github.com./pymc-devs/pymc3/pull/3925/files#diff-1f25198dab6a35cd27fffe043b7e1b9dL1485

@AlexAndorra
Copy link
Contributor

Hi Anatoly,
Thanks for the suggestion. Are you running on PyMC master though? This should already be fixed by the PR I mentionned above, which allows to use both pm.Data and pm.set_data.
This new feature will be shiped with PyMC 3.9 🖖

@NowanIlfideme
Copy link

NowanIlfideme commented Jun 3, 2020

Yes, sorry if I didn't make it clear - I'm running 3.8 (can't use a master version), but my temp fix is for folks who need to run "in production" RIGHT NOW and can't use "unstable versions" (or wait until 3.9 release 😉).

Thanks for your work!

@AlexAndorra
Copy link
Contributor

Ow ok, thanks for the clarification!
FWIW, 3.9 should come soon 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants