Why PostgreSQL?

Nguyễn Tuấn Anh

ubolonton@gmail.com

Table of Contents

Data Modeling

Powerful DDL

  • DEFAULT can be an expression (e.g. custom primary key generation)
  • Custom logic when changing a column's data type
  • CHECK constraints

Transactional

  • Freedom to try out schemas
  • Migrations don't make system unavailable
  • No inconsistent state on failed migration

Rich Data Types

Strongly typed

Base

  • Text, numeric types
  • Boolean
  • Timestamp, time interval

Advanced

  • Geometric types
  • Bit string
  • Money
  • Network address

Array

  • Type-checked
  • Queryable: comparison, containment, concatenation, aggregation
CREATE TABLE events (
    weekday INTEGER[]
);

Range

  • Integers, floats, timestamps, dates
  • Inclusive, exclusive, infinity
  • Queryable: containment, overlap, union…
  • Indexable

JSON

  • Queryable: field, nested field, to array, length
  • Indexable

Can be used as a document store!

CREATE TABLE json_log(
    item JSON
);

INSERT INTO json_log (item)
VALUES
('{"remote_addr": "130.185.109.239","remote_user": "-","time_local": "13/Feb/2014:07:13:17 -0500","request": "GET /robots.txt HTTP/1.1","status": "200","body_bytes_sent": "0","http_referer": "-","http_user_agent": "PagesInventory (robot http://www.pagesinvenotry.com)"}'),
...
('{"remote_addr": "46.137.98.52","remote_user": "-","time_local": "13/Feb/2014:11:37:47 -0500","request": "HEAD / HTTP/1.0","status": "200","body_bytes_sent": "0","http_referer": "-","http_user_agent": "-"}');

SELECT item->>'remote_addr', item->>'request', item->>'http_user_agent'
FROM json_log
WHERE item->>'http_user_agent' LIKE '%robot%';

Composite

CREATE TYPE log_item AS (
    remote_addr TEXT,
    time_local TIMESTAMP,
    request TEXT,
    http_user_agent TEXT
);

CREATE TABLE server_log (
    id SERIAL PRIMARY KEY,
    item log_item
);

INSERT INTO server_log (item)
  SELECT ROW(item->>'remote_addr', (item->>'time_local')::timestamp, item->>'request', item->>'http_user_agent')::log_item
  FROM json_log;

SELECT (item).*
FROM server_log
WHERE (item).http_user_agent LIKE '%robot%';

Data Querying

Window Functions

  • Like aggregation, but don't collapse the group
  • Each row has access to aggregated result of rows similar to it

Examples

Users and their first orders

SELECT DISTINCT u.id, u.name
     , FIRST_VALUE(o.id) OVER creation order_id
     , FIRST_VALUE(o.amount) OVER creation order_amount
FROM users u
JOIN orders o ON o.user_id = u.id
WINDOW creation AS (PARTITION BY user_id ORDER BY created);

Users and their biggest orders

SELECT DISTINCT u.id, u.name
     , FIRST_VALUE(o.id) OVER creation order_id
     , FIRST_VALUE(o.amount) OVER creation order_amount
FROM users u
JOIN orders o ON o.user_id = u.id
WINDOW creation AS (PARTITION BY user_id ORDER BY amount DESC);

Order's ordinal number per-user

SELECT u.id, u.name, o.id order_id, o.amount
     , RANK() OVER creation user_order_number
FROM users u
JOIN orders o ON o.user_id = u.id
WINDOW creation AS (PARTITION BY user_id ORDER BY created);

Common Table Expressions

More readable queries

Break complicated nested logic into sequential steps

WITH regional_sales AS (
        SELECT region, SUM(amount) AS total_sales
        FROM orders
        GROUP BY region
     ), top_regions AS (
        SELECT region
        FROM regional_sales
        WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
     )
SELECT region,
       product,
       SUM(quantity) AS product_units,
       SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;

Recursive queries (hierarchical data)

CREATE TABLE category (
    id SERIAL PRIMARY KEY,
    description TEXT,
    parent_id INTEGER REFERENCES category(id)
);

WITH RECURSIVE subs(id, parent_id, ids, descriptions) AS (
    SELECT id, parent_id
         , ARRAY[id], ARRAY[description]
    FROM category
    UNION ALL
    SELECT s.id, c.parent_id
         , ARRAY_PREPEND(c.id, s.ids)
         , ARRAY_PREPEND(c.description, s.descriptions)
    FROM subs s, category c
    WHERE s.parent_id = c.id
)
SELECT id, ids, descriptions
FROM subs;

Performance

Indexing

  • Partial (e.g. unique if not null)
  • On expression
  • Can tune for sorting

Example

Fast date-component query (month, year, week)

CREATE INDEX bid_created_hour ON bid (date_part('hour'::text, created));

CREATE INDEX bid_created_weekday ON bid (date_part('hour'::text, created));
CREATE INDEX bid_created_weekday ON bid (EXTRACT(DOW FROM created));

Query Planner

  • Whole-query planning (views/sub-queries are not optimization boundaries)
  • Well-documented tuning knobs, no black magic or hidden knowledge
  • Informative EXPLAIN, EXPLAIN ANALYZE
  • For newbies: http://explain.depesz.com/

Example

                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=44203.48..44203.49 rows=1 width=0) (actual time=1621.464..1621.465 rows=1 loops=1)
   ->  Bitmap Heap Scan on bid  (cost=656.38..44126.42 rows=30824 width=0) (actual time=344.487..1430.674 rows=2896367 loops=1)
         Recheck Cond: ((date_part('dow'::text, created) >= 3::double precision) AND (date_part('dow'::text, created) <= 5::double precision))
         Rows Removed by Index Recheck: 7866
         ->  Bitmap Index Scan on bid_created_weekday  (cost=0.00..648.68 rows=30824 width=0) (actual time=342.796..342.796 rows=2896367 loops=1)
               Index Cond: ((date_part('dow'::text, created) >= 3::double precision) AND (date_part('dow'::text, created) <= 5::double precision))
 Total runtime: 1621.679 ms

Extensibility and Other Stuff

Foreign Data Wrapper

  • Query csv log file
  • Query json log file
  • Query MySQL!
  • Query MongoDB!

Example - JSON Log File

CREATE FOREIGN TABLE nginx_log_2
(
    remote_addr TEXT,
    remote_user TEXT,
    time_local TIMESTAMP WITH TIME ZONE,
    request TEXT,
    status TEXT,
    body_bytes_sent TEXT,
    http_referer TEXT,
    http_user_agent TEXT
)
SERVER json_server
OPTIONS (filename '/home/ubolonton/Programming/projects/postgresql-examples/nginx-log/access.log.2.gz', max_error_count '99999');

Example - JSON Log File

-- Request counts by status
SELECT status, COUNT(*)
FROM nginx_log_cleaned
GROUP BY status
ORDER BY status;

-- Average response size
SELECT AVG(body_bytes_sent)
FROM nginx_log_cleaned;

-- All requests not from me
SELECT COUNT(*)
FROM nginx_log_cleaned
WHERE http_user_agent NOT LIKE '%conkeror%';

-- Requests by hour
WITH tmp AS (
    SELECT *, EXTRACT(HOUR FROM (time_local AT TIME ZONE '+07:00')) AS hour
    FROM nginx_log_cleaned
)
SELECT hour, COUNT(*)
FROM tmp
GROUP BY hour
ORDER BY hour;

json/hstore Data Type

  • Document DB
  • But with powerful query capability
  • And powerful indexing

http://thebuild.com/presentations/pg-as-nosql-pgday-fosdem-2013.pdf

Scripting Languages

  • Python
  • Javascript

Documentation

  • One of the best pieces of technical writing
  • To-the-point, clearly written, extensive

Links

Years working with MySQL don't give as much understanding as 1 month of reading PostgreSQL documentation

Flame-baiting

A Few of MySQL WTFs

  • Views are optimization boundaries
  • Most of the time, subqueries are optimization boundaries (conditions from outer queries cannot be combined with those from inner queries; generated in-mem tables have no indexes)
  • Silent conversion between data types, silent truncation
  • http://dev.mysql.com/doc/refman/5.5/en/silent-column-changes.html
  • CHECK is ignored (http://dev.mysql.com/doc/refman/5.5/en/create-table.html)
  • DEFAULT can only be constant (except for TIMESTAMP)
  • Single index per table in each query
  • Bad documentation. For example try finding out the behavior of transaction isolation levels
  • Always use temporary file if the query selects a TEXT/BLOB column (even if the columns are unused in filtering/sorting)
  • Silently convert non-aggregate non-group-by columns to ANY(…)
  • NOT NULL is ignored for date & enum in most versions https://github.com/philipsoutham/py-mysql2pgsql/

Contact

ubolonton on Github, LinkedIn, Skype, Facebook, Twitter, Google+…

Any question, comment, feedback, or correction is welcome.