304 lines
16 KiB
Markdown
304 lines
16 KiB
Markdown
# PostgreSQL Partitioning for Zabbix
|
|
|
|
This is the declarative partitioning implementation for Zabbix `history*`, `trends*`, and `auditlog` tables on PostgreSQL. This solution is intended to replace standard Zabbix housekeeping for the configured tables. Partitioning is very useful for large environments because it completely eliminates the housekeeper from the process. Instead of huge DELETE queries on several million rows, fast DDL queries (ALTER TABLE) are executed, which drop an entire partition.
|
|
|
|
|
|
> [!WARNING]
|
|
> 1. **Data Visibility**: After enabling partitioning, old data remains in `*_old` tables and is **NOT visible** in Zabbix. You must migrate data manually if needed.
|
|
> 2. **Disable Housekeeping**: You **MUST** disable Zabbix Housekeeper for History and Trends in *Administration -> Housekeeping*.
|
|
|
|
## Table of Contents
|
|
- [Architecture](#architecture)
|
|
- [Components](#components)
|
|
- [Installation](#installation)
|
|
- [Configuration](#configuration)
|
|
- [Modifying Retention](#modifying-retention)
|
|
- [Maintenance](#maintenance)
|
|
- [Scheduling Maintenance](#scheduling-maintenance)
|
|
- [Monitoring & Permissions](#monitoring--permissions)
|
|
- [Versioning](#versioning)
|
|
- [Least Privilege Access (`zbxpart_monitor`)](#least-privilege-access-zbxpart_monitor)
|
|
- [Implementation Details](#implementation-details)
|
|
- [`auditlog` Table](#auditlog-table)
|
|
- [Converting Existing Tables](#converting-existing-tables)
|
|
- [PostgreSQL Tuning](#postgresql-tuning)
|
|
- [Uninstall / Reverting](#uninstall--reverting)
|
|
- [Upgrades](#upgrades)
|
|
|
|
## Architecture
|
|
|
|
The solution uses PostgreSQL native declarative partitioning (`PARTITION BY RANGE`).
|
|
All procedures, information, statistics and configuration are stored in the `partitions` schema to maintain full separation from Zabbix schema.
|
|
|
|
### Components
|
|
1. **Configuration Table**: `partitions.config` defines retention policies.
|
|
2. **Maintenance Procedure**: `partitions.run_maintenance()` manages partition lifecycle.
|
|
3. **Monitoring View**: `partitions.monitoring` provides system state visibility.
|
|
4. **Version Table**: `partitions.version` provides information about installed version of the partitioning solution.
|
|
|
|
## Installation
|
|
|
|
> [!IMPORTANT]
|
|
> **Please refer to the [MANUAL.md](MANUAL.md) for the complete, step-by-step, foolproof installation instructions.**
|
|
> The manual contains critical safety procedures, backup warnings, and copy-pasteable commands for a safe deployment.
|
|
|
|
|
|
## Configuration
|
|
|
|
Partitioning policies are defined in the `partitions.config` table.
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `table_name` | text | Name of the Zabbix table (e.g., `history`, `trends`). |
|
|
| `period` | text | Partition interval: `day`, `week`, or `month`. |
|
|
| `keep_history` | interval | Data retention period (e.g., `30 days`, `12 months`). |
|
|
| `future_partitions` | integer | Number of future partitions to pre-create (buffer). Default: `5`. |
|
|
| `last_updated` | timestamp | Timestamp of the last successful maintenance run. |
|
|
|
|
### Modifying Retention
|
|
To change the retention period for a table, update the configuration:
|
|
|
|
```sql
|
|
UPDATE partitions.config
|
|
SET keep_history = '60 days'
|
|
WHERE table_name = 'history';
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
The maintenance procedure `partitions.run_maintenance()` is responsible for:
|
|
1. Creating future partitions (current period + `future_partitions` buffer).
|
|
2. Creating past partitions (backward coverage based on `keep_history`).
|
|
3. Dropping partitions older than `keep_history`.
|
|
|
|
This procedure should be scheduled to run periodically (e.g., daily via `pg_cron` or system cron).
|
|
|
|
```sql
|
|
CALL partitions.run_maintenance();
|
|
```
|
|
### Scheduling Maintenance
|
|
|
|
To ensure partitions are created in advance and old data is cleaned up, the maintenance procedure should be scheduled to run automatically.
|
|
|
|
It is recommended to run the maintenance **twice a day** and not in round hours because of the way housekeeper works (e.g., at 05:30 and 23:30).
|
|
* **Primary Run**: Creates new future partitions and drops old ones.
|
|
* **Secondary Run**: Acts as a safety check. Since the procedure is idempotent (safe to run multiple times), a second run ensures everything is consistent if the first run failed or was interrupted.
|
|
|
|
You can schedule this using one of the following methods:
|
|
|
|
#### Option 1: `pg_cron` (Recommended)
|
|
`pg_cron` is a cron-based job scheduler that runs directly inside the database as an extension. It is very useful for cloud based databases like AWS RDS, Aurora, Azure, GCP, because it handles the authentication/connections securely for you automatically and its available as a managed extension. You do **not** need to install OS packages or configure anything. Simply modify the RDS Parameter Group to include `shared_preload_libraries = 'pg_cron'` and `cron.database_name = 'zabbix'`, reboot the instance, and execute `CREATE EXTENSION pg_cron;`.
|
|
|
|
**Setup `pg_cron` (Self-Hosted):**
|
|
1. Install the package via your OS package manager (e.g., `postgresql-15-cron` on Debian/Ubuntu, or `pg_cron_15` on RHEL/CentOS).
|
|
2. Configure it modifying `postgresql.conf`:
|
|
```ini
|
|
shared_preload_libraries = 'pg_cron'
|
|
cron.database_name = 'zabbix'
|
|
```
|
|
3. Restart PostgreSQL:
|
|
```bash
|
|
systemctl restart postgresql
|
|
```
|
|
4. Connect to your `zabbix` database as a superuser and create the extension:
|
|
```sql
|
|
CREATE EXTENSION pg_cron;
|
|
```
|
|
5. Schedule the job to run:
|
|
```sql
|
|
SELECT cron.schedule('zabbix_partition_maintenance', '30 5,23 * * *', 'CALL partitions.run_maintenance();');
|
|
```
|
|
|
|
**⚠️ Troubleshooting `pg_cron` Connection Errors:**
|
|
If your cron jobs fail to execute and you see `FATAL: password authentication failed` in your PostgreSQL logs, it is because `pg_cron` attempts to connect via TCP (`localhost`) by default, which usually requires a password.
|
|
|
|
**Solution A: Use Local Unix Sockets (Easier)**
|
|
Edit your `postgresql.conf` to force `pg_cron` to use the local Unix socket (which uses passwordless `peer` authentication):
|
|
```ini
|
|
cron.host = '/var/run/postgresql' # Or '/tmp', depending on your OS
|
|
```
|
|
*(Restart PostgreSQL after making this change).*
|
|
|
|
**Solution B: Provide a Password (`.pgpass`)**
|
|
If you *must* connect via TCP with a specific database user and password, the `pg_cron` background worker needs a way to authenticate. You provide this by creating a `.pgpass` file for the OS `postgres` user.
|
|
1. Switch to the OS database user:
|
|
```bash
|
|
sudo su - postgres
|
|
```
|
|
2. Create or append your database credentials to `~/.pgpass` using the format `hostname:port:database:username:password`:
|
|
```bash
|
|
echo "localhost:5432:zabbix:zabbix:my_secure_password" >> ~/.pgpass
|
|
```
|
|
3. Set strict permissions (PostgreSQL will ignore the file if permissions are too loose):
|
|
```bash
|
|
chmod 0600 ~/.pgpass
|
|
```
|
|
|
|
**Managing `pg_cron` Jobs:**
|
|
If you need to verify or manage your scheduled jobs (run as superuser):
|
|
- To **list all active schedules**: `SELECT * FROM cron.job;`
|
|
- To **view execution logs/history**: `SELECT * FROM cron.job_run_details;`
|
|
- To **remove/unschedule** the job: `SELECT cron.unschedule('zabbix_partition_maintenance');`
|
|
|
|
#### Option 2: Systemd Timers
|
|
Systemd timers provide better logging and error handling properties than standard cron.
|
|
|
|
1. Create a service file **`/etc/systemd/system/zabbix-partitions.service`**:
|
|
```ini
|
|
[Unit]
|
|
Description=Zabbix PostgreSQL Partition Maintenance
|
|
After=network.target postgresql.service
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
User=postgres
|
|
ExecStart=/usr/bin/psql -d zabbix -c "CALL partitions.run_maintenance();"
|
|
```
|
|
|
|
2. Create a timer file **`/etc/systemd/system/zabbix-partitions.timer`**:
|
|
```ini
|
|
[Unit]
|
|
Description=Run Zabbix Partition Maintenance Twice Daily
|
|
|
|
[Timer]
|
|
OnCalendar=*-*-* 05:30:00
|
|
OnCalendar=*-*-* 23:30:00
|
|
Persistent=true
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
```
|
|
|
|
3. Enable and start the timer:
|
|
```bash
|
|
systemctl daemon-reload
|
|
systemctl enable --now zabbix-partitions.timer
|
|
```
|
|
|
|
#### Option 3: System Cron (`crontab`)
|
|
Standard system cron is a simple fallback.
|
|
|
|
**Example Crontab Entry (`crontab -e`):**
|
|
```bash
|
|
# Run Zabbix partition maintenance twice daily (5:30 AM and 5:30 PM)
|
|
30 5,23 * * * psql -U zabbix -d zabbix -c "CALL partitions.run_maintenance();" >> /var/log/zabbix_maintenance.log 2>&1
|
|
```
|
|
|
|
**Docker Environment:**
|
|
If running in Docker, you can execute it via the host's cron by targeting the container:
|
|
```bash
|
|
30 5,23 * * * docker exec zabbix-db-test psql -U zabbix -d zabbix -c "CALL partitions.run_maintenance();"
|
|
```
|
|
|
|
## Monitoring & Permissions
|
|
|
|
System state can be monitored via the `partitions.monitoring` view. It includes the information about number of future partitions and the time since the last maintenance run. Plus it includes the total size of the partitioned table in bytes.
|
|
|
|
```sql
|
|
SELECT * FROM partitions.monitoring;
|
|
```
|
|
|
|
### Zabbix Agent Integration
|
|
To monitor the state of the partitions directly from Zabbix, you need to provide the Zabbix Agent with the SQL query used to fetch this data. You can automatically generate the required `partitions.get_all.sql` file on your agent using this one-liner:
|
|
|
|
```bash
|
|
cat << 'EOF' | sudo tee /etc/zabbix/zabbix_agent2.d/partitions.get_all.sql > /dev/null
|
|
SELECT
|
|
table_name,
|
|
period,
|
|
keep_history::text AS keep_history,
|
|
configured_future_partitions,
|
|
actual_future_partitions,
|
|
total_size_bytes,
|
|
EXTRACT(EPOCH FROM (now() - last_updated)) AS age_seconds
|
|
FROM partitions.monitoring;
|
|
EOF
|
|
```
|
|
*(Make sure to adjust the destination path according to your Zabbix Agent template directory)*
|
|
|
|
### Versioning
|
|
To check the installed version of the partitioning solution:
|
|
```sql
|
|
SELECT * FROM partitions.version ORDER BY installed_at DESC LIMIT 1;
|
|
```
|
|
|
|
### Least Privilege Access (`zbxpart_monitor`)
|
|
For monitoring purposes, it is highly recommended to create a dedicated user with read-only access to the monitoring view instead of using the `zbxpart_admin` owner account.
|
|
|
|
```sql
|
|
CREATE USER zbxpart_monitor WITH PASSWORD 'secure_password';
|
|
GRANT USAGE ON SCHEMA partitions TO zbxpart_monitor;
|
|
GRANT SELECT ON partitions.monitoring TO zbxpart_monitor;
|
|
```
|
|
|
|
> [!WARNING]
|
|
> Because `03_monitoring_view.sql` uses a `DROP VIEW` command to apply updates, re-running the script will destroy all previously assigned `GRANT` permissions. If you ever update the view script, you **must** manually re-run the `GRANT SELECT` command above to restore access for the `zbxpart_monitor` user!
|
|
|
|
## Implementation Details
|
|
|
|
### `auditlog` Table
|
|
The standard Zabbix `auditlog` table has a primary key on `(auditid)`. Partitioning by `clock` requires the partition key to be part of the primary key.
|
|
To prevent placing a heavy, blocking lock on an `auditlog` table to alter its primary key, the enablement script (`02_enable_partitioning.sql`) detects it and handles it exactly like the history tables: it automatically renames the live, existing table to `auditlog_old`, and instantly creates a brand new, empty partitioned `auditlog` table pre-configured with the required `(auditid, clock)` composite primary key.
|
|
|
|
### Converting Existing Tables
|
|
The enablement script guarantees practically zero downtime by automatically renaming the existing tables to `table_name_old` and creating new partitioned tables matching the exact schema.
|
|
* **Note**: Data from the old tables is NOT automatically migrated to minimize downtime.
|
|
* New data flows into the new partitioned tables immediately.
|
|
* Old data remains accessible in `table_name_old` for manual lookup or migration if required.
|
|
|
|
### Housekeeper Interceptor
|
|
Even when Zabbix Housekeeping is disabled in the UI for History and Trends, the Zabbix Server daemon may still generate and insert tasks into the `housekeeper` table (e.g., when an item or trigger is deleted, it schedules the deletion of its historical data). Without intervention, this results in the `housekeeper` table bloating massively over time, leading to slow sequential scans and `autovacuum` overhead.
|
|
|
|
To prevent this, this extension installs a `BEFORE INSERT` trigger on the `housekeeper` table.
|
|
* When Zabbix attempts to insert a housekeeper task, the trigger intercepts it and checks if the target table is managed in `partitions.config`.
|
|
* If the table is partitioned (like `history`), the trigger **silently discards the insert** (`RETURNS NULL`), preventing disk I/O and table bloat entirely.
|
|
* If the table is not partitioned (like `events` or `sessions`), the task is allowed to be recorded and is cleaned up naturally by Zabbix.
|
|
|
|
## PostgreSQL Tuning
|
|
|
|
Before or immediately after enabling partitioning, you should tune your `postgresql.conf`. The standard configuration is not optimized for partitioned tables and might cause performance degradation or out-of-memory errors.
|
|
|
|
| Parameter | Recommended | Description |
|
|
|-----------|-------------|-------------|
|
|
| `max_locks_per_transaction`| `512` (or higher) | **Requires DB Restart.** Default is `64`, which is far too low. PostgreSQL lock tables per partition. With many partitioned tables (e.g., history x 30 days), operations like `pg_dump`, `VACUUM`, or queries crossing multiple boundaries will fail with *“out of shared memory”*. |
|
|
| `jit` | `off` | **Highly Recommended.** JIT adds overhead to query planning. With many partitions, JIT can drastically increase CPU usage as PostgreSQL attempts to optimize simple queries across dozens of partitions. |
|
|
|
|
**Default parameters to verify:**
|
|
The following are usually set correctly by default, but you should verify them just in case:
|
|
* `enable_partition_pruning = on` : **Critical.** Ensures PostgreSQL only queries the necessary partitions instead of scanning everything.
|
|
* `enable_partitionwise_join = off` : Zabbix does not do massive joins on history tables; enabling this only wastes planner CPU time.
|
|
* `enable_partitionwise_aggregate = off` : Zabbix doesn't perform complex DB-side `GROUP BY` aggregations on history. Leave it disabled.
|
|
|
|
## Uninstall / Reverting
|
|
|
|
If you wish to stop using partitioning and revert back to standard, unpartitioned tables without data loss, carefully follow these steps.
|
|
|
|
> [!CAUTION]
|
|
> Reverting partitioning replaces your partitioned tables with standard empty tables. If you need to retain data from the partitioned period, you must manually migrate it before dropping the partition sets. **Always stop Zabbix Server before proceeding.**
|
|
|
|
1. **Stop Zabbix Server** to prevent new data from being inserted during the transition.
|
|
2. **Execute Undo Script:** Run the `04_undo_partitioning.sql` script to recreate non-partitioned tables matching your original Zabbix schema. This script will rename your current partitioned tables to `*_part` (`history_part`, `trends_part`, etc.) and automatically create native, clean tables (`history`, `trends`) in their place.
|
|
```bash
|
|
psql -h $DB_HOST -U zbxpart_admin -d zabbix -f 04_undo_partitioning.sql
|
|
```
|
|
3. **Data Migration (Optional):** If you want to keep the metrics collected during the partitioned period, you must manually insert them into the newly created regular tables. This step can take hours depending on table sizes.
|
|
```sql
|
|
INSERT INTO history SELECT * FROM history_part;
|
|
INSERT INTO trends SELECT * FROM trends_part;
|
|
-- Repeat for all tables you wish to restore
|
|
```
|
|
4. **Cleanup:** Once you have migrated the data you need (or if you don't need it at all), you can drop the heavy partitioned tables and remove the partitioning extensions completely.
|
|
```sql
|
|
DROP TABLE history_part CASCADE;
|
|
DROP TABLE history_uint_part CASCADE;
|
|
-- Repeat for all *_part tables ...
|
|
|
|
-- To drop the automatic maintenance infrastructure:
|
|
DROP SCHEMA partitions CASCADE;
|
|
```
|
|
5. **Start Zabbix Server & Re-enable Housekeeper:** Once the tables are replaced, you can start the server. *Don't forget to re-enable Housekeeping for History and Trends in the Zabbix UI!*
|
|
|
|
## Upgrades
|
|
1. **Backup**: Ensure a full database backup exists.
|
|
2. **Compatibility**: Zabbix upgrade scripts may attempt to `ALTER` tables. PostgreSQL supports `ALTER TABLE` on partitioned tables for adding columns, which propagates to partitions.
|
|
3. **Failure Scenarios**: If an upgrade script fails due to partitioning, the table may need to be temporarily reverted or the partition structure manually adjusted. |