trinosqlwindow-functionudfdata-platform

Practical Trino SQL Patterns — Window Functions, UNNEST, Lambdas, and UDFs

A roundup of SQL patterns we use daily for analytics in Trino: ranking and sessionization with window functions, flattening arrays/maps with UNNEST, transforming collections with lambdas and higher-order functions, federated joins, and SQL/Python UDFs.

Data DynamicsJune 5, 20268 min read

Trino follows ANSI SQL faithfully, yet it also ships powerful extensions for semi-structured data — arrays, maps, JSON — and higher-order functions. Once you know these tools, work you used to post-process in application code can be done in a single SQL statement.

This post collects the SQL patterns we reach for most often in real-world analytics on Trino — in the order window functions → UNNEST → lambdas/higher-order functions → federation → UDFs — with copy-paste-ready examples.

1. Window Functions — Computing Across Rows

Window functions compute over a "set of related rows" for each row, without collapsing rows the way GROUP BY does.

Per-Group Ranking / Top-N

-- Top 3 products by revenue per category
SELECT category, product, revenue
FROM (
  SELECT category, product, revenue,
         row_number() OVER (PARTITION BY category ORDER BY revenue DESC) AS rn
  FROM sales
)
WHERE rn <= 3;

Function	Behavior
`row_number()`	No ties — always 1, 2, 3, …
`rank()`	Allows ties, skips the next rank (1, 1, 3)
`dense_rank()`	Allows ties, no gaps (1, 1, 2)

Running Totals / Moving Averages

SELECT
  d,
  revenue,
  sum(revenue) OVER (ORDER BY d ROWS UNBOUNDED PRECEDING) AS running_total,
  avg(revenue) OVER (ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma7
FROM daily_sales;

A ROWS BETWEEN ... PRECEDING AND CURRENT ROW frame defines a sliding window. This is essential for time-series metrics like 7-day moving averages and running totals.

LAG / LEAD — Referencing the Previous/Next Row

-- Time gap since the previous event
SELECT
  user_id, event_time,
  event_time - lag(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS gap
FROM events;

The Sessionization Pattern

With LAG plus a conditional running sum, you can implement session boundaries like "start a new session after 30 minutes of inactivity" in pure SQL.

WITH marked AS (
  SELECT user_id, event_time,
         CASE WHEN event_time - lag(event_time)
                   OVER (PARTITION BY user_id ORDER BY event_time)
                   > INTERVAL '30' MINUTE
              THEN 1 ELSE 0 END AS is_new_session
  FROM events
)
SELECT user_id, event_time,
       sum(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM marked;

2. UNNEST — Flattening Arrays/Maps into Rows

Trino treats composite types like ARRAY, MAP, and ROW as first-class citizens. UNNEST expands a collection into rows.

-- Turn rows where one order holds multiple products (an array) into per-product rows
SELECT order_id, item
FROM orders
CROSS JOIN UNNEST(items) AS t(item);
 
-- Expand with an index (ordinal position) via WITH ORDINALITY
SELECT order_id, idx, item
FROM orders
CROSS JOIN UNNEST(items) WITH ORDINALITY AS t(item, idx);
 
-- Expand a MAP (key, value)
SELECT id, k, v
FROM tbl
CROSS JOIN UNNEST(attributes) AS t(k, v);

Going the other way — collecting rows into an array — use array_agg:

SELECT order_id, array_agg(item ORDER BY item) AS items
FROM order_items
GROUP BY order_id;

3. Working with JSON

Log and event payloads frequently arrive as JSON strings.

SELECT
  json_extract_scalar(payload, '$.user.id')      AS user_id,
  json_extract_scalar(payload, '$.event.type')   AS event_type,
  cast(json_extract(payload, '$.items') AS ARRAY(VARCHAR)) AS items
FROM raw_events;

Function	Returns
`json_extract_scalar`	A scalar value (as a string)
`json_extract`	A JSON fragment (for further parsing)
`json_parse` / `cast(... AS JSON)`	String → JSON type

4. Lambdas and Higher-Order Functions — Transforming Collections in SQL

One of Trino's real strengths is its higher-order functions that apply a function to an array. You can transform, filter, and aggregate arrays in place — no UNNEST required.

SELECT
  transform(prices, x -> x * 1.1)                 AS with_tax,   -- map
  filter(scores, x -> x >= 60)                    AS passed,     -- filter
  reduce(nums, 0, (s, x) -> s + x, s -> s)        AS total,      -- fold/sum
  array_sort(tags)                                AS sorted_tags,
  any_match(flags, x -> x = true)                 AS has_true,
  cardinality(filter(scores, x -> x >= 60))       AS pass_count
FROM t;

Higher-order function	Purpose
`transform(arr, x -> ...)`	Element transformation (map)
`filter(arr, x -> ...)`	Conditional filtering
`reduce(arr, init, (s,x)->..., s->...)`	Accumulation (fold)
`any_match` / `all_match` / `none_match`	Condition checks
`zip_with(a, b, (x,y)->...)`	Combine two arrays

Example: "count of scores ≥ 60 in an array" — which used to require UNNEST + GROUP BY — becomes a one-liner with cardinality(filter(...)).

5. Federation — Joins Across Catalogs

This is the feature that defines Trino's identity: joining data from different systems in a single query.

-- Events on object storage (Iceberg) × customer master in an RDBMS (PostgreSQL)
SELECT e.event_type, c.tier, count(*) AS cnt
FROM iceberg.analytics.events e
JOIN postgresql.crm.customers c ON e.user_id = c.id
WHERE e.event_time >= TIMESTAMP '2026-06-01 00:00:00 UTC'
GROUP BY e.event_type, c.tier;
 
-- Load data from one system into another (federated CTAS)
CREATE TABLE iceberg.analytics.customer_snapshot AS
SELECT * FROM postgresql.crm.customers;

Performance tip: RDBMS connectors push WHERE clauses and aggregations down to the source database via predicate/aggregate pushdown. Small dimension tables are a natural fit for broadcast joins, while joins between large tables need table statistics so the CBO can pick the right distribution strategy.

6. A Toolbox of Everyday Functions

-- Safe casting (NULL on failure)
try_cast(value AS INTEGER)
 
-- NULL-safe comparison / substitution
coalesce(a, b, 0)
nullif(a, 0)             -- NULL when a=0 (guards against division by zero)
 
-- Dates (the unit comes first!)
date_add('day', 7, current_date)
date_diff('hour', t1, t2)
date_trunc('month', event_time)
format_datetime(event_time, 'yyyy-MM-dd')
 
-- Strings
split(path, '/')                      -- to an array
regexp_extract(ua, '(\d+)\.(\d+)', 1)
url_extract_host(url)
 
-- Approximate aggregation (fast on large data)
approx_distinct(user_id)              -- approximate distinct count
approx_percentile(latency, 0.95)      -- p95

On large datasets, approx_distinct / approx_percentile are far faster and use much less memory than an exact count(distinct). Use them liberally wherever a small error margin is acceptable, such as dashboard metrics.

7. UDFs — When Built-In Functions Aren't Enough

You can encapsulate recurring logic in a function.

SQL UDFs (Inline / Session or Catalog Registration)

-- Define on the fly inside a query (WITH FUNCTION)
WITH FUNCTION to_won(usd DOUBLE)
  RETURNS DOUBLE
  RETURN usd * 1330.0
SELECT product, to_won(price_usd) AS price_krw
FROM catalog.schema.products;

Even complex logic can be expressed as a SQL UDF, and registering it permanently in a catalog lets the whole team reuse it.

CREATE FUNCTION catalog.schema.mask_email(email VARCHAR)
RETURNS VARCHAR
RETURN regexp_replace(email, '(^.).*(@.*$)', '$1***$2');

Python UDFs

Logic that's awkward to express with built-in or SQL functions can be written as a Python UDF (executed in a sandbox). This is handy for tasks where procedural code feels natural, such as string normalization or custom parsing.

CREATE FUNCTION catalog.schema.py_slug(s VARCHAR)
RETURNS VARCHAR
LANGUAGE PYTHON
WITH (handler = 'slug')
AS $$
def slug(s):
    return "-".join(s.lower().split()) if s else None
$$;

Caution: UDFs — Python UDFs especially — are powerful, but they're a black box to the optimizer. Wrapping a column in a UDF inside a WHERE clause can break pushdown and pruning. For filters over large data, prefer built-in functions and keep UDFs in the projection (SELECT) stage.

8. Anti-Patterns — SQL That Wrecks Performance

Anti-pattern	Problem	Alternative
`WHERE CAST(ts AS DATE) = ...`	Breaks pushdown and pruning	Range comparison `ts >= ... AND ts < ...`
`WHERE year(ts) = 2026`	Function wrapping → full scan	Range comparison
`SELECT *`	Scans/transfers columns you don't need	Only the columns you need
Overusing `count(distinct big_col)`	Memory and runtime blow up	`approx_distinct`
Joining the big table first	Intermediate results explode	Statistics + CBO, small side first
Wrapping UDFs in WHERE	Optimizer black box	Move to the SELECT stage

9. Summary

Tool	Use case	Key functions/syntax
Window functions	Ranking, running totals, moving averages, sessionization	`row_number`, `sum() OVER`, `lag/lead`
UNNEST	Arrays/maps → rows	`CROSS JOIN UNNEST ... WITH ORDINALITY`
Higher-order functions	Array transform/filter/aggregate	`transform`, `filter`, `reduce`
JSON	Parsing semi-structured data	`json_extract_scalar`
Federation	Joining heterogeneous sources	`catalog.schema.table` joins
Approximate aggregation	Fast metrics on large data	`approx_distinct`, `approx_percentile`
UDFs	Encapsulating logic	`WITH FUNCTION`, `CREATE FUNCTION`

The essence of Trino SQL is pulling post-processing that used to live in application code back into SQL. Just getting comfortable with window functions and higher-order functions shrinks your pipeline code dramatically, and federation plus approximate aggregation turn analyses that are tedious on other engines into one-liners. Just remember that wrapping functions or UDFs around columns in WHERE clauses breaks pruning — "filter with built-ins, transform freely."

This article is based on the Trino 440 series. If you need help designing complex analytical queries or tuning SQL performance, feel free to reach out.

— The Data Dynamics Engineering Team