歡迎來到 LETSQL 教學系列的第一篇文章!
在這篇部落格文章中,我們脫離了通常的資料管道主題,以 DataFusion 為例示範如何使用 Poetry 建立和發布 Python 套件。
Harlequin 是 SQL 資料庫的 TUI 用戶端,以其對 SQL 資料庫的輕量級廣泛支援而聞名。它是用於資料探索和分析工作流程的多功能工具。 Harlequin 提供了一個互動式 SQL 編輯器,具有自動完成、語法突出顯示和查詢歷史記錄等功能。它還具有可以顯示大型結果集的結果檢視器。然而,Harlequin 之前沒有 DataFusion 適配器。值得慶幸的是,添加一個真的很容易。
在這篇文章中,我們將透過為 DataFusion 建立 Harlequin 適配器來示範這些概念。並且,透過這樣做,我們還將介紹 Poetry 的基本功能、專案設定以及在 PyPI 上發布套件的步驟。
要充分利用本指南,您應該對虛擬環境、Python 套件和模組以及 pip 有基本的了解。
我們的目標是:
最後,您將獲得 Poetry 的實務經驗並了解現代 Python 套件管理。
本文實作的程式碼可以在 GitHub 上找到,也可以在 PyPI 中找到。
Harlequin 是一個在終端機中運作的 SQL IDE。它提供了傳統命令列資料庫工具的強大且功能豐富的替代方案,使其適用於資料探索和分析工作流程。
有關 Harlequin 的一些重要事項:
DataFusion 是一種快速、可擴展的查詢引擎,用於使用 Apache Arrow 記憶體格式在 Rust 中建立高品質的以資料為中心的系統。
DataFusion 提供 SQL 和 Dataframe API、卓越的效能、對 CSV、Parquet、JSON 和 Avro 的內建支援、廣泛的客製化以及出色的社群。
它附帶了自己的 CLI,可以在此處找到更多資訊。
Poetry 是一款功能豐富的現代工具,可簡化 Python 專案的依賴管理和打包,使開發更加確定性和高效。
來自文件:
Poetry 是 Python 中的依賴管理和打包工具。它允許您聲明您的專案所依賴的庫,並且它將為您管理(安裝/更新)它們。
Poetry 提供了一個鎖定檔案來確保可重複安裝,並可以建立您的專案進行分發。
Harlequin 適配器是一個 Python 套件,允許 Harlequin 與資料庫系統一起使用。
適配器是一個 Python 套件,它在 harlequin.adapters 群組中聲明一個入口點。此入口點應引用 HarlequinAdapter 抽象基底類別的子類別。
這使得 Harlequin 能夠發現已安裝的適配器並在運行時實例化選定的適配器
除了 HarlequinAdapter 類別之外,套件還必須提供 HarlequinConnection 和 HarlequinCursor 的實作。更詳細的描述可以在這個
指導。
開發 Harlequin 適配器的第一步是從現有的 harlequin-adapter-template 產生一個新的儲存庫
GitHub 範本是作為新專案起點的儲存庫。它們提供預先配置的文件、結構和設置,這些文件、結構和設置可以複製到新存儲庫,從而可以快速設置項目,而無需分叉的開銷。
此功能簡化了根據既定模式創建一致、結構良好的項目的過程。
harlequin-adapter-template 附帶一個詩歌.lock 文件和一個 pyproject.toml 文件,以及一些用於定義所需類別的樣板程式碼。
在討論編碼細節之前,讓我們先探討一下包分發所需的基本文件。
pyproject.toml 檔案現在是配置 Python 套件以進行發布和其他工具的標準。這一 TOML 格式的檔案在 PEP 518 和 PEP 621 中引入,將多個設定檔合併為一個。它透過使其更加健壯和標準化來增強依賴管理。
Poetry,利用 pyproject.toml 處理專案的虛擬環境、解決依賴關係並建立套件。
模板的pyproject.toml如下:
[tool.poetry] name = "harlequin-myadapter" version = "0.1.0" description = "A Harlequin adapter for <my favorite database>." authors = ["Ted Conbeer <tconbeer@users.noreply.github.com>"] license = "MIT" readme = "README.md" packages = [ { include = "harlequin_myadapter", from = "src" }, ] [tool.poetry.plugins."harlequin.adapter"] my-adapter = "harlequin_myadapter:MyAdapter" [tool.poetry.dependencies] python = ">=3.8.1,<4.0" harlequin = "^1.7" [tool.poetry.group.dev.dependencies] ruff = "^0.1.6" pytest = "^7.4.3" mypy = "^1.7.0" pre-commit = "^3.5.0" importlib_metadata = { version = ">=4.6.0", python = "<3.10.0" } [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"
可以看出:
pyproject.toml 檔案的 [tool.poetry] 部分是定義 Python 套件的元資料的位置,例如名稱、版本、描述、作者等。
[tool.poetry.dependency] 小節是您聲明專案所需的執行時間依賴項的位置。跑步詩新增
[tool.poetry.dev-dependency] 小節是您聲明僅開發依賴項的位置,例如測試框架、linter 等。
[build-system] 部分用於儲存與建置相關的資料。在本例中,它將建置後端指定為「poetry.core.masonry.api」。狹義上來說,核心責任是
build-backend就是建構wheels和sdist。
儲存庫還包括一個詩歌.lock 文件,這是透過運行詩歌安裝或詩歌更新生成的特定於詩歌的組件。此鎖定檔案指定專案的所有依賴項和子依賴項的確切版本,確保跨不同環境的可重複安裝。
避免手動編輯詩歌.lock 檔案至關重要,因為這可能會導致不一致和安裝問題。相反,對 pyproject.toml 檔案進行更改,並允許 Poetry 透過執行 Poetry Lock 自動更新鎖定檔案。
Per Poetry 的安裝警告
::: {.警告}
Poetry 應始終安裝在專用的虛擬環境中,以將其與系統的其他部分隔離。在任何情況下都不應將其安裝在由 Poetry 管理的專案環境中。
:::
這裡我們假設您可以透過執行 pipx install詩來存取 Poetry
明確了文件結構後,讓我們透過設定環境來開始開發過程。由於我們的專案已經包含 pyproject.toml 和詩歌.lock 文件,因此我們可以使用詩歌 shell 命令啟動我們的環境。
此指令啟動連結到目前 Poetry 專案的虛擬環境,確保所有後續操作都發生在專案的依賴上下文中。如果不存在虛擬環境,poetry shell 會自動建立並啟動一個。
poetry shell 偵測您目前的 shell 並在虛擬環境中啟動一個新實例。由於 Poetry 預設集中虛擬環境,因此此命令無需尋找或呼叫啟動腳本的特定路徑。
要驗證 Poetry 目前使用的是哪個 Python 環境,您可以使用以下命令:
poetry env list --full-path
這將顯示與您的專案關聯的所有虛擬環境,並指示目前處於活動狀態的虛擬環境。
作為替代方案,您可以只取得目前環境的完整路徑:
poetry env info -p
啟動環境後,使用詩歌安裝來安裝所需的依賴。該指令的工作原理如下
為了完成環境設置,我們需要將資料融合庫添加到我們的依賴項中。執行以下指令:
poetry add datafusion
此命令使用 datafusion 套件更新 pyproject.toml 檔案並安裝它。如果您不指定版本,Poetry 會根據可用的軟體包版本自動選擇合適的版本。
To create a Harlequin Adapter, you need to implement three interfaces defined as abstract classes in the harlequin.adapter module.
The first one is the HarlequinAdapter.
#| eval: false #| code-fold: false #| code-summary: implementation of HarlequinAdapter class DataFusionAdapter(HarlequinAdapter): def __init__(self, conn_str: Sequence[str], **options: Any) -> None: self.conn_str = conn_str self.options = options def connect(self) -> DataFusionConnection: conn = DataFusionConnection(self.conn_str, self.options) return conn
The second one is the HarlequinConnection, particularly the methods execute and get_catalog.
#| eval: false #| code-fold: false #| code-summary: implementation of execution of HarlequinConnection def execute(self, query: str) -> HarlequinCursor | None: try: cur = self.conn.sql(query) # type: ignore if str(cur.logical_plan()) == "EmptyRelation": return None except Exception as e: raise HarlequinQueryError( msg=str(e), title="Harlequin encountered an error while executing your query.", ) from e else: if cur is not None: return DataFusionCursor(cur) else: return None
For brevity, we've omitted the implementation of the get_catalog function. You can find the full code in the adapter.py file within our GitHub repository.
Finally, a HarlequinCursor implementation must be provided as well:
#| eval: false #| code-fold: false #| code-summary: implementation of HarlequinCursor class DataFusionCursor(HarlequinCursor): def __init__(self, *args: Any, **kwargs: Any) -> None: self.cur = args[0] self._limit: int | None = None def columns(self) -> list[tuple[str, str]]: return [ (field.name, _mapping.get(field.type, "?")) for field in self.cur.schema() ] def set_limit(self, limit: int) -> DataFusionCursor: self._limit = limit return self def fetchall(self) -> AutoBackendType: try: if self._limit is None: return self.cur.to_arrow_table() else: return self.cur.limit(self._limit).to_arrow_table() except Exception as e: raise HarlequinQueryError( msg=str(e), title="Harlequin encountered an error while executing your query.", ) from e
Your adapter must register an entry point in the harlequin.adapters group using the packaging software you use to build your project.
If you use Poetry, you can define the entry point in your pyproject.toml file:
[tool.poetry.plugins."harlequin.adapter"] datafusion = "harlequin_datafusion:DataFusionAdapter"
An entry point is a mechanism for code to advertise components it provides to be discovered and used by other code.
Notice that registering a plugin with Poetry is equivalent to the following pyproject.toml specification for entry points:
[project.entry-points."harlequin.adapter"] datafusion = "harlequin_datafusion:DataFusionAdapter"
The template provides a set of pre-configured tests, some of which are applicable to DataFusion while others may not be relevant. One test that's pretty cool checks if the plugin can be discovered, which is crucial for ensuring proper integration:
#| eval: false #| code-fold: false if sys.version_info < (3, 10): from importlib_metadata import entry_points else: from importlib.metadata import entry_points def test_plugin_discovery() -> None: PLUGIN_NAME = "datafusion" eps = entry_points(group="harlequin.adapter") assert eps[PLUGIN_NAME] adapter_cls = eps[PLUGIN_NAME].load() assert issubclass(adapter_cls, HarlequinAdapter) assert adapter_cls == DataFusionAdapter
To make sure the tests are passing, run:
poetry run pytest
The run command executes the given command inside the project’s virtualenv.
With the tests passing, we're nearly ready to publish our project. Let's enhance our pyproject.toml file to make our package more discoverable and appealing on PyPI. We'll add key metadata including:
These additions will help potential users find and understand our package more easily.
classifiers = [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Topic :: Software Development :: User Interfaces", "Topic :: Database :: Database Engines/Servers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: Implementation :: CPython" ] readme = "README.md" repository = "https://github.com/mesejo/datafusion-adapter"
For reference:
We're now ready to build our library and verify its functionality by installing it in a clean virtual environment. Let's start with the build process:
poetry build
This command will create distribution packages (both source and wheel) in the dist directory.
The wheel file should have a name like harlequin_datafusion-0.1.1-py3-none-any.whl. This follows the standard naming convention:
To test the installation, create a new directory called test_install. Then, set up a fresh virtual environment with the following command:
python -m venv .venv
To activate the virtual environment on MacOS or Linux:
source .venv/bin/activate
After running this command, you should see the name of your virtual environment (.venv) prepended to your command prompt, indicating that the virtual environment is now active.
To install the wheel file we just built, use pip as follows:
pip install /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl
Replace /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl with the actual path to the wheel file you want to install.
If everything works fined, you should see some dependencies installed, and you should be able to do:
harlequin -a datafusion
Congrats! You have built a Python library. Now it is time to share it with the world.
The best practice before publishing to PyPI is to actually publish to the Test Python Package Index (TestPyPI)
To publish a package to TestPyPI using Poetry, follow these steps:
Create an account at TestPyPI if you haven't already.
Generate an API token on your TestPyPI account page.
Register the TestPyPI repository with Poetry by running:
poetry config repositories.test-pypi https://test.pypi.org/legacy/
To publish your package, run:
poetry publish -r testpypi --username __token__ --password <token>
Replace
python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple <package-name>
This command uses two key arguments:
Replace
To publish to the actual Python Package Index (PyPI) instead:
Create an account at https://pypi.org/ if you haven't already.
Generate an API token on your PyPI account page.
Run:
poetry publish --username __token__ --password <token>
The default repository is PyPI, so there's no need to specify it.
Is worth noting that Poetry only supports the Legacy Upload API when publishing your project.
Manually publishing each time is repetitive and error-prone, so to fix this problem, let us create a GitHub Action to
publish each time we create a release.
Here are the key steps to publish a Python package to PyPI using GitHub Actions and Poetry:
Set up PyPI authentication: You must provide your PyPI credentials (the API token) as GitHub secrets so the GitHub Actions workflow can access them. Name these secrets something like PYPI_TOKEN.
Create a GitHub Actions workflow file: In your project's .github/workflows directory, create a new file like publish.yml with the following content:
name: Build and publish python package on: release: types: [ published ] jobs: publish-package: runs-on: ubuntu-latest permissions: contents: write steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install Poetry uses: snok/install-poetry@v1 - run: poetry config pypi-token.pypi "${{ secrets.PYPI_TOKEN }}" - name: Publish package run: poetry publish --build --username __token__
The key is to leverage GitHub Actions to automate the publishing process and use Poetry to manage your package's dependencies and metadata.
Poetry is a user-friendly Python package management tool that simplifies project setup and publication. Its intuitive command-line interface streamlines environment management and dependency installation. It supports plugin development, integrates with other tools, and emphasizes testing for robust code. With straightforward commands for building and publishing packages, Poetry makes it easier for developers to share their work with the Python community.
At LETSQL, we're committed to contributing to the developer community. We hope this blog post serves as a straightforward guide to developing and publishing Python packages, emphasizing best practices and providing valuable resources.
To subscribe to our newsletter, visit letsql.com.
As we continue to refine the adapter, we would like to provide better autocompletion and direct reading from files (parquet, csv) as in the DataFusion-cli. This requires a tighter integration with the Rust library without going through the Python bindings.
Your thoughts and feedback are invaluable as we navigate this journey. Share your experiences, questions, or suggestions in the comments below or on our community forum. Let's redefine the boundaries of data science and machine learning integration.
Thanks to Dan Lovell and Hussain Sultan for the comments and the thorough review.
以上是如何使用 Poetry 建立新的 Harlequin 適配器的詳細內容。更多資訊請關注PHP中文網其他相關文章!