-
Notifications
You must be signed in to change notification settings - Fork 111
Distributed compute refactoring #1047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Deploying datachain-documentation with
|
Latest commit: |
c86c9a6
|
Status: | ✅ Deploy successful! |
Preview URL: | https://73b23262.datachain-documentation.pages.dev |
Branch Preview URL: | https://distributed-refactoring.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a refactoring of the distributed compute components along with several improvements and cleanups across tests, UDF dispatching, SQL query handling, and CLI commands. Key changes include:
- Consolidation and cleanup of test functions and internal UDF dispatch APIs.
- Refinements in error handling, queue management, and process parallelism to improve robustness.
- Updates in SQL query constructions, particularly in the metastore update, that require careful validation.
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/unit/test_dispatch.py | Consolidation of imports and removal of a redundant test case. |
tests/test_query_e2e.py | Updated expected process return codes in the interruption test. |
src/datachain/utils.py | Minor improvements in parallel processing configuration. |
src/datachain/query/utils.py | Stricter type checking for query column retrieval. |
src/datachain/query/udf.py | Removal of the unused abstract static method run_worker. |
src/datachain/query/dispatch.py | Refactoring of UDF dispatch logic with new catalog property and queue handling. |
src/datachain/query/dataset.py | Improved exception handling and use of walrus operator in dataset population. |
src/datachain/query/batch.py | Deferred import for SELECT_BATCH_SIZE moved into the function. |
src/datachain/lib/listing.py | Updated list function naming to better reflect its purpose. |
src/datachain/data_storage/warehouse.py | New method added to select dataset rows from a list of IDs. |
src/datachain/data_storage/metastore.py | Revised dataset version update with a new SQL WHERE clause. |
src/datachain/cli/parser/utils.py | Updated internal commands for argument parsing. |
src/datachain/cli/init.py | Adjustments to handle UDF dispatch commands. |
Comments suppressed due to low confidence (5)
tests/unit/test_dispatch.py:38
- The removal of the test_send_stop_signal_to_workers function reduces the test coverage for UDFDispatcher's behavior when sending stop signals. Consider reintroducing a test or ensuring coverage through alternative tests.
- def test_send_stop_signal_to_workers():
tests/test_query_e2e.py:200
- The expected process return code has been changed from 1 to 11. Please verify that the update to (interrupt_exit_code, 11) is intentional and consistent with the underlying interruption handling logic.
- if process.returncode not in (interrupt_exit_code, 1):
src/datachain/query/udf.py:44
- The abstract static method run_worker has been removed. Ensure that its removal aligns with the overall UDF design and that no consumers rely on this method for worker execution.
- @staticmethod
src/datachain/query/dispatch.py:166
- The call to self.run_udf_parallel within run_udf may unintentionally recurse into the same method rather than invoking the separate parallel processing logic defined later. Consider reviewing the method naming or call structure to avoid potential confusion or recursion issues.
+ self.run_udf_parallel(n_workers, input_rows, download_cb, processed_cb, generated_cb)
src/datachain/query/utils.py:25
- The updated check now raises an error if the column is not an instance of Column, which might exclude valid ColumnElement types like TextClause. Confirm that this stricter type check is intended.
+ if col is None or not isinstance(col, Column):
dv = self._datasets_versions | ||
self.db.execute( | ||
self._datasets_versions_update() | ||
.where(dv.c.dataset_id == dataset.id and dv.c.version == version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the Python 'and' operator in the SQL WHERE clause may not build the intended SQL expression. Consider replacing it with sqlalchemy.and_(dv.c.dataset_id == dataset.id, dv.c.version == version) to ensure proper SQL clause formation.
.where(dv.c.dataset_id == dataset.id and dv.c.version == version) | |
.where(and_(dv.c.dataset_id == dataset.id, dv.c.version == version)) |
Copilot uses AI. Check for mistakes.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1047 +/- ##
==========================================
- Coverage 88.19% 88.09% -0.10%
==========================================
Files 146 146
Lines 12507 12570 +63
Branches 1742 1755 +13
==========================================
+ Hits 11030 11073 +43
- Misses 1053 1065 +12
- Partials 424 432 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
Draft, description to be added soon.