-
Notifications
You must be signed in to change notification settings - Fork 526
Multiple Entrypoints and Shared State implementation; Request for comments/feedback. #8030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@JacobSzwejbka @kimishpatel Thoughts? |
Ah yeah this is the recommended way to do this for now. I've brought it up to export before cc @angelayi because it has an obvious failing of what happens if one of the functions you are monkey patching internally calls forward. |
Yeah, I was thinking about this from your comment earlier. For at least the use case trying to export hugging face models, there needs to be a wrapper anyway, in order to have the static-kv cache implementation be a part of the export model state. If you go with the assumption that we will be writing a 'lightweight wrapper' for the model class, we just make sure that that wrapper does not have a forward method, then I am pretty sure it should be safe. Alternatively, one could auto-generate a wrapper class, placing whatever model you want to self.model, and then have methods call into it. |
My initial impression is that the individual prs all seem good. The cpp pattern looks good (SharedMemoryManager). I'll take a deeper look at the python exporter tomorrow. I'm a little hesitant to introduce a new wrapper on top of things in python. I'd like to see if theres a way we can take the solutions in your exporter and just integrate them into EdgeProgramManager. |
This is very cool, and thank you so much for writing up your process and the hurdles you had to get through export -- it was a great read!
I agree that export does not have a good model for multiple entrypoints, but for now, having a wrapper class generally works for exporting methods. |
### Summary This is a set of changes that I have been using to get executorch working with Multiple Entrypoints with shared state. This is related to: #8030 ### Issue & Changes Currently trying to export a function: ```python def init(): self.shared_state.fill_(0) ``` Will fail, because in to_edge in exir in the program.py the lifted constants are assumed to be placed before the first user input. This causes weird behavior when there are no user inputs. I changed it to place the lifted constant after the last buffer, if there are no user inputs to place before. However, this relies on my understanding of the implicit layout of the graph inputs. At the same time later in the after the memory planning phase of to_executorch, it validates that memory planning was correct based on whether the `graph_input_allocated` flag is set, this only applies to user inputs, of which we have none, so it errors out. I added a check to bypass this error if there are no user inputs, but I honestly do not understand enough of the validation check to know if that is appropriate. ### Comments In the current executorch with no support for shared state, this case does not make sense, but #8030 is my attempt at adding that capability. and having initialization methods that init the buffers from constants is useful, especially since their initail state is undefined. Currently this is not ready, I have no tests/ and have my random notes in the commit as comments, and other than validating that it worked as I was working on the shared state export have not done anything in depth.... But I kind-of want feedback if my solution seems correct, or if I am missing something. Particularly regarding my understanding of placeholder ordering and signature logic, and whether bypassing the graph_input_allocated validation is appropriate. Out of all the changes I had to make for shared state, this is the one I am least sure about. cc @JacobSzwejbka @angelayi
…#8031) ### Summary This is a set of changes that I have been using to get executorch working with Multiple Entrypoints with shared state. This is related to: pytorch#8030 ### Issue & Changes Currently trying to export a function: ```python def init(): self.shared_state.fill_(0) ``` Will fail, because in to_edge in exir in the program.py the lifted constants are assumed to be placed before the first user input. This causes weird behavior when there are no user inputs. I changed it to place the lifted constant after the last buffer, if there are no user inputs to place before. However, this relies on my understanding of the implicit layout of the graph inputs. At the same time later in the after the memory planning phase of to_executorch, it validates that memory planning was correct based on whether the `graph_input_allocated` flag is set, this only applies to user inputs, of which we have none, so it errors out. I added a check to bypass this error if there are no user inputs, but I honestly do not understand enough of the validation check to know if that is appropriate. ### Comments In the current executorch with no support for shared state, this case does not make sense, but pytorch#8030 is my attempt at adding that capability. and having initialization methods that init the buffers from constants is useful, especially since their initail state is undefined. Currently this is not ready, I have no tests/ and have my random notes in the commit as comments, and other than validating that it worked as I was working on the shared state export have not done anything in depth.... But I kind-of want feedback if my solution seems correct, or if I am missing something. Particularly regarding my understanding of placeholder ordering and signature logic, and whether bypassing the graph_input_allocated validation is appropriate. Out of all the changes I had to make for shared state, this is the one I am least sure about. cc @JacobSzwejbka @angelayi
🚀 The feature, motivation and pitch
Sorry this is kind-of long...
Overview
Hi, I am not really sure what the process for this is, but I made some changes/a wrapping to executorch to be able to have a shared state between multiple exported methods of a torch.module.
For now I implemented this in my own repo: cptspacemanspiff/execu-tools)
My initial motivation for this is that when exporting Encoder/Decoder models one would really like to use a precalculated cross attention cache, but this is hard to do without conditional branching (via torch.cond)/rewriting or extraneous copies/calculations.
My high-level thought process is that, for the most part, for a given model class (sequence to sequence, decoder only, etc), the location of forced graph breaks from the torch export process (due to either data-dependent branching or what not) are mostly in the same place. If you export the individual parts in python then glue these together with minimal code on the C++ side, The C++ runtime implementation should be able to generalize to other models of the same class. From what I can tell, this is already the case for decoder only models, but for other architectures that still have state this is less well defined.
Currently, I have 2 examples exported / working:
greedy
search, and there were changes to the models attention to support non-fixed batch sizes and encoder sequence lengths that are still not merged into transformers.What I am asking for:
I have been working on this on my own, but it seems like it is something that would be more generally useful. I just started really digging into the torch.export process/ executorch in the past month so may have some fundamental misunderstandings.
Aside from the changes required from executorch (of which there are surprisingly few), Most of my changes were implemented in my own repository, mainly because it was easier to setup, and I was able to avoid futzing around with the executorch build system. That being said, if this is a functionality that aligns with executorch's goals I would like to get it merged in. At least the exporter class that manages everything, alongside the c++ memory manager that handles things on the runtime side, though I would need advice on that.
Also, some specific things I have done to get this working are somewhere on the scale between somewhat janky and very janky. I would appreciate feedback/thoughts on alternatives/advice.
This is also providing a pointer/context issue for the executorch changes/pull requests (see bottom).
I am making a few pull requests that are not fully fleshed out with regards to testing/linting/etc, but illustrate what changes I needed to make to the executorch repository.
Alternatives
I looked at using torch.cond to explicitly implement the branching logic required. The issue I ran across was that each branch must be fully functional, with the same graph signature for each submodule on each branch. This forces the overall method signature must be the same in all cases, which leads to extra copy backs even when data is not modified in a branch.
The other alternative is writing a lot of C++ code, which adds complexity and bugs when inevitably it does not exactly match the python implementation.
RFC (Optional)
More details are here:
README: https://github.com./[cptspacemanspiff/execu-tools](https://github.com./cptspacemanspiff/execu-tools)
And a more in-depth rational/build log here: https://github.com./cptspacemanspiff/execu-tools/blob/main/DesignReasoning.md
What this looks like:
For the python export:
C++
High level:
Currently most of the wrapping logic is sitting in my own custom repo here:
Somewhat based on comments from #7458. While there were some changes to executorch itself, it is mostly a wrapper which that has a custom memory pass to place the shared state buffers into a common memory address.
Specifically, the main steps in the process are:
Related issues/pull requests (I am still in the process of uploading these, will add issue numbers as I get them.):
#7515 -> #7810 -> Fix copy back buffers are not the same
Export of function with no user input fails: #8031
Allow for multi-method exports to use the ETRecord and the inspector-cli: #8336
Add ability to pass hierarchical allocator to runtime Module method: #8032
Allow to override default event tracing names: #8033
cc @JacobSzwejbka @angelayi
The text was updated successfully, but these errors were encountered: