PostgreSQL: Data analysis and analytics

Cybertec Training: Data Analysis

Hans-Jurgen Schonig

www.postgresql-support.de

Hans-Jurgen Schonigwww.postgresql-support.de

Introduction


Scope of this training

I Importing dataI Simple aggregationsI Windowing and analyticsI Analyzing time seriesI Managing incomplete dataI Writing custom aggregates


Importing data


Loading data

Things to consider when importing data

I There are many ways to import dataI Avoid mini-transactions for performance reasonsI In case of large data sets speed is a major issueI There is a life after importing


Importing a simple data set

A simple data structure . . .

test=# CREATE TABLE t_test (a int, b int);

CREATE TABLE

Let us add 10.000 rows now . . .

INSERT INTO t_test VALUES (1, 2);

...



Using one transaction to import things

BEGIN;


...


COMMIT;


Observations

I Performance can vary depending on hardware

I Longer transactions can be WAYS faster

I PostgreSQL has to flush every transaction to disk

I Most of the time is burned by flushing


Changing durability requirements

I Performance will sky rocket . . .

SET synchronous_commit TO off;


...


I The reason is that PostgreSQL does not have to flush everytransaction anymore.

I Trading “durability” for performance


Use bulk loads

I Loading single rows is usually a bad ideaI Use COPY to do bulk loadingI COPY can load data A LOT faster than INSERT due to

significantly smaller overhead


A simple COPY

This time 10 million lines are imported: (10.000 rows are notenough)

COPY t_test FROM stdin;

1 2

1 2

...

\.

Note the performance difference (rows per second). There is noneed to check column lists, existence of table, etc. anymore ->higher throughput


COPY: Observations to be made

I In the default configuration (= checkpoint segments = 3) youwill see a steady up and down of I/O speed

I This is caused by checkpoints happening in the backgroundI Data has to go to the transaction log to “repair” data files in

case of a crashI Performance is limited by writing data twice


COPY: Using a transaction log bypass

I Writing to the transaction log can be avoided in some cases:

BEGIN;

TRUNCATE t_test;

COPY t_test FROM stdin;

...

COMMIT;


COPY: Why the bypass works

I TRUNCATE will schedule the removal of the data file onCOMMIT

I COPY will start writing to a new data file

I Concurrency is not an issue because TRUNCATE locks thetable

I PostgreSQL can take the old or the new data file on COMMITor ROLLBACK

I no need to actually repair a data file anymore


COPY: More on WAL-bypassing

I BEGIN;

CREATE TABLE ...

I WAL bypassing only works if you are not using streamingreplication


Freezing rows

I Before you import data, run . . .

ALTER TABLE t_test

SET (autovacuum_enabled = off);

I Compare the timing of the first

SELECT count(id) FROM t_test;

with the second one.


Observations

I The first run is a lot slower

I During the first run, writes will happen

I No more writes from the second run on

I PostgreSQL sets bits in the background


The purpose of hint bits

I On first write PostgreSQL checks if a row can be seen byeverybody.

I This bit is set to make sure that PostgreSQL does not have togo through expensive visibility checks next time.

I This is an issue for big data sets


Fixing after-import performance

I To set hint bits straight away do a

test=# COPY t_test FROM ’/tmp/file.txt’ FREEZE;

I Be careful. It only works in case . . .

ERROR: cannot perform FREEZE because the table

was not created or truncated in the

current subtransaction


VACUUM and hint bits

I Bits can be set on an entire block as well (not just on rows)

I VACUUM will set those hint bits

I However, block bits will not speed things up as much asrow-level bits usually

I For a heavily used read-only database system vacuuming datacan actually make sense (not to reclaim space)


Importing some test data:

Here is some test data:

CREATE TABLE t_oil (country text,

year int,

production int);

COPY t_oil FROM PROGRAM

’curl www.cybertec.at/secret/oil.txt’;


Simple aggregations


Basic aggregation

test=# SELECT country, avg(production)

FROM t_oil

GROUP BY 1;

country | avg

---------------+-----------------------

USA | 9141.3478260869565217

Saudi Arabien | 7641.8260869565217391

(2 rows)


GROUP BY is needed for aggregates

I A GROUP BY clause is needed because otherwise groupscannot be built:

test=# SELECT country, avg(production) FROM t_oil;

ERROR: column "t_oil.country" must appear in the

GROUP BY clause or be used in an

aggregate function

LINE 1: SELECT country, avg(production) FROM t_oil;


HAVING: Filtering on aggregated data

test=# SELECT country, avg(production)

FROM t_oil

GROUP BY 1

HAVING avg(production) > 8000;

country | avg

---------+-----------------------

USA | 9141.3478260869565217

(1 row)

I NOTE that an alias is not allowed in a HAVING clause


Windowing and analytics


The purpose of windowing

An analogy:

I Your car is not a valuable car because it is goodI It is valuable because it is better than the ones driven by your

friends

This is what windowing does: The current row in relation to allrows in the reference group


Windowing vs GROUP BY

I GROUP BY has been designed to reduce the amount of dataand turn it into aggregated values

I Windowing is used to compare values and put them intorelation.

I Windowing is used along with aggregate functions (e.g. sum,count, avg, min, max, . . . )


A simple aggregate: Average values

SELECT *, avg(production) OVER () FROM t_oil ;

country | year | production | avg

---------------+------+------------+---------------

USA | 1965 | 9014 | 8391.58695652

USA | 1966 | 9579 | 8391.58695652

USA | 1967 | 10219 | 8391.58695652

USA | 1968 | 10600 | 8391.58695652

USA | 1969 | 10828 | 8391.58695652

...


What does the result mean?

I ‘Give me all rows and the average “over” all rows in the table’

I Logically it is the same as . . .

SELECT *, (SELECT avg(production) FROM t_oil) AS avg

FROM t_oil;

I However, subselects can be very nasty if the task is a morecomplex one


OVER()-clauses can define order

I Calculate max production up to a certain point

SELECT *, max(production) OVER (ORDER BY year)

FROM t_oil

WHERE country = ’Saudi Arabien’;

I Saudi Arabia is a so called ‘swing producer’.

I Note that max stays up even if production declines


OVER()-clauses can form groups

I Averages for each country

SELECT *, avg(production)

OVER (PARTITION BY country)

FROM t_oil;


---------------+------+------------+---------------

Saudi Arabien | 1965 | 2219 | 7641.82608695

Saudi Arabien | 1966 | 2615 | 7641.82608695

...

USA | 1965 | 9014 | 9141.34782608

USA | 1966 | 9579 | 9141.34782608

...


Forming groups

I Data is split into groupsI Each row shows the average of all rows in its groupI Note that we got one group (= window) per country


OVER() can contain order and groups

SELECT *, max(production)

OVER (PARTITION BY country ORDER BY year)

FROM t_oil;

I In this case we get the maximum up to a given pointI This is done for each country


Abstracting window-clauses

SELECT *,

min(production) OVER (w),

max(production) OVER (w),

count(production) OVER (w)

FROM t_oil

WINDOW w AS (PARTITION BY country ORDER BY year)

I the same clause can be used for many columnsI many window-clauses may exist (w, w2, w3, etc.)


rank() and dense rank()

I Data can be ranked according to some order

I In case of duplicates

I rank gives 1, 2, 2, 2, 5I dense rank gives 1, 2, 2, 2, 3


Moving rows: lead

I ORDER BY defines into which direction to “move” the row

I the number defines the offset

SELECT *, lag(production, 1) OVER (ORDER BY year)

FROM t_oil WHERE country = ’USA’;

country | year | production | lag

---------+------+------------+-------

USA | 1965 | 9014 |

USA | 1966 | 9579 | 9014

USA | 1967 | 10219 | 9579

USA | 1968 | 10600 | 10219


Calculating the change in production

I Very easy thing to do now

SELECT *, production - lag(production, 1)

OVER (ORDER BY year)


country | year | production | ?column?

---------+------+------------+----------

USA | 1965 | 9014 |

USA | 1966 | 9579 | 565

USA | 1967 | 10219 | 640

USA | 1968 | 10600 | 381


lead is the opposite of lag

I lag is the same as ‘lead(. . . , -1)’I lag pushes elements downI lead pushes elemens up


moving entire rows

SELECT *, lag(t_oil, 1) OVER (ORDER BY year)



---------+------+------------+------------------

USA | 1965 | 9014 |

USA | 1966 | 9579 | (USA,1965,9014)

USA | 1967 | 10219 | (USA,1966,9579)

USA | 1968 | 10600 | (USA,1967,10219)

I the composite type can then be disected using a subselect forthe current query


works for more than just one column

SELECT *, lag((year, production), 1)

OVER (ORDER BY year)



---------+------+------------+--------------

USA | 1965 | 9014 |

USA | 1966 | 9579 | (1965,9014)

USA | 1967 | 10219 | (1966,9579)

USA | 1968 | 10600 | (1967,10219)

I this is the perfect foundation to build custom aggregates tosolve complex problems


Splitting data into equal parts

I ntile can split your data into n equally sized blocksI ntile(4) will therefore give you a nice quantile distributionI order is needed to achieve that


Here is how it works . . .

SELECT year, production, ntile(4)

OVER (ORDER BY production)

FROM t_oil WHERE country = ’USA’ ORDER BY 3, 2 DESC;

year | production | ntile

------+------------+-------

2000 | 7733 | 1

1999 | 7731 | 1

...

1966 | 9579 | 2

1989 | 9159 | 2

...

1972 | 11185 | 4


Work can proceed from there

SELECT ntile, min(production), max(production)

FROM ( SELECT year, production, ntile(4)

OVER (ORDER BY production)

FROM t_oil WHERE country = ’USA’) AS x

GROUP BY 1 ORDER BY 1


The query returns nice quantiles

ntile | min | max

-------+-------+-------

1 | 6734 | 7733

2 | 8011 | 9579

3 | 9736 | 10231

4 | 10247 | 11297

(4 rows)


Moving averages

I More sophisticated frame-clauses are neededI The average is done for 2 years = current + previous one

SELECT *, avg(production) OVER (ORDER BY year ROWS

BETWEEN 1 PRECEDING AND 0 FOLLOWING)

FROM t_oil WHERE country = ’Saudi Arabien’;


---------------+------+------------+------------

Saudi Arabien | 1965 | 2219 | 2219.0000

Saudi Arabien | 1966 | 2615 | 2417.0000

Saudi Arabien | 1967 | 2825 | 2720.0000


Combining joins, aggregates, and windowing


Combining data

I To combine data we need some more import data

CREATE TABLE t_president

(name text,

start_year int,

end_year int,

party text);


Some input data

I A list of all American presidents and their presidency

test=# COPY t_president FROM PROGRAM

’curl www.cybertec.at/secret/president.txt’;

COPY 9

I The format is not too nice for analysis


Input data: American presidents

SELECT * FROM t_president ;

name | start_year | end_year | party

-------------------+------------+----------+------------

Lyndon B. Johnson | 1963 | 1969 | Democrat

Richard M. Nixon | 1969 | 1974 | Republican

Gerald Ford | 1974 | 1977 | Republican

Jimmy Carter | 1977 | 1981 | Democrat

Ronald W. Reagan | 1981 | 1989 | Republican

George H. W. Bush | 1989 | 1993 | Republican

Bill Clinton | 1993 | 2001 | Democrat

George W. Bush | 2001 | 2009 | Republican

Barack Obama | 2009 | 2017 | Democrat


The challenge: Adjust the format

I LATERAL can come to the rescue

SELECT name, party, year

FROM t_president AS x,

LATERAL (SELECT * FROM

generate_series(x.start_year, x.end_year - 1)

AS year) AS y

LIMIT 8;


The output is:

name | party | year

-------------------+------------+------

Lyndon B. Johnson | Democrat | 1963






Richard M. Nixon | Republican | 1969





Which party is better for oil?

I The following way to solve the problem is definitely not theonly one.

I There might be other factors than the party of the presidentwhen it comes to this kind of data.

I Keep in mind: It is just an SQL exercise


Putting things together (1)

CREATE VIEW v AS

WITH b AS (

SELECT name, party, year

FROM t_president AS x,

LATERAL (SELECT * FROM generate_series(

x.start_year,

x.end_year - 1) AS year) AS y)

SELECT a.*, party,

production - lag(production, 1)

OVER (ORDER BY a.year) AS lag

FROM t_oil AS a, b

WHERE a.year = b.year AND country = ’USA’;


What we got so far

SELECT * FROM v;

country | year | production | party | lag

---------+------+------------+------------+------

USA | 1965 | 9014 | Democrat |

USA | 1966 | 9579 | Democrat | 565

USA | 1967 | 10219 | Democrat | 640

USA | 1968 | 10600 | Democrat | 381

USA | 1969 | 10828 | Republican | 228

USA | 1970 | 11297 | Republican | 469


Making use of NULL

I Remember NULL is ignored by aggregate functionsI We can use that to do ‘partial counts’

SELECT party, lag,

CASE WHEN lag > 0 THEN 1 END AS up,

CASE WHEN lag < 0 THEN 1 END AS down

FROM v

ORDER BY year;


Which gives us . . .

party | lag | up | down

------------+------+----+------

Democrat | | |

Democrat | 565 | 1 |



Republican | 228 | 1 |


Republican | -141 | | 1





We can move on from there easily

SELECT party,

count(CASE WHEN lag > 0 THEN 1 END) AS up,

count(CASE WHEN lag < 0 THEN 1 END) AS down

FROM v

GROUP BY party;

party | up | down

------------+----+------

Democrat | 9 | 8

Republican | 10 | 18

(2 rows)


Handling missing data


Preparing our sample data

test=# UPDATE t_oil

SET production = NULL

WHERE year IN (1998, 1999)

AND country = ’USA’ RETURNING *;

country | year | production

---------+------+------------

USA | 1998 |

USA | 1999 |

(2 rows)


Challenges ahead

I How can we make lead and lag work again?I How can we fill the gaps?I How can we control our behavior in a more efficient way?


Turning to frame-clauses once again

I One idea is to just use the average of some previous valuesI However, you might also want to turn to interpolation or

outright guess workI A custom aggregate might help


A ‘lazy’ idea

I Creating an array with some historic values

I Applying a function on this array

SELECT year, production, array_agg(production)

OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING

AND 0 FOLLOWING)

FROM t_oil

WHERE country = ’USA’;


Which gives us . . .

... snip ...

1995 | 8322 | {8868,8583,8389,8322}

1996 | 8295 | {8583,8389,8322,8295}

1997 | 8269 | {8389,8322,8295,8269}

1998 | | {8322,8295,8269,NULL}

1999 | | {8295,8269,NULL,NULL}

2000 | 7733 | {8269,NULL,NULL,7733}

2001 | 7669 | {NULL,NULL,7733,7669}

2002 | 7626 | {NULL,7733,7669,7626}

2003 | 7400 | {7733,7669,7626,7400}

... snip ...


Applying a function

I A simple function could look like this:

SELECT avg(x)

FROM unnest(’{8295,8269,NULL,NULL}’::int4[]) AS x;

avg

-----------------------

8282.0000000000000000

(1 row)


A query could therefore look like this

SELECT *, (SELECT avg(x) FROM unnest(array_agg) AS x)

FROM (SELECT year, production, array_agg(production)

OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING

AND 0 FOLLOWING)

FROM t_oil WHERE country = ’USA’) AS y

OFFSET 32 LIMIT 4;

year | production | array_agg | avg

------+------------+-----------------------+-------------

1997 | 8269 | {8389,8322,8295,8269} | 8318.750000

1998 | | {8322,8295,8269,NULL} | 8295.333333

1999 | | {8295,8269,NULL,NULL} | 8282.000000

2000 | 7733 | {8269,NULL,NULL,7733} | 8001.000000


Defining an aggregate

I Defining an aggregate is really the more desirable way

I It is ways more clean

I CREATE AGGREGATE is your friend


A simple example

I the aggregate can be created like this:

CREATE FUNCTION my_final(int[]) RETURNS numeric AS

$$

SELECT avg(x) FROM unnest($1) AS x;

$$ LANGUAGE sql;

CREATE AGGREGATE artificial_avg(int) (

SFUNC = array_append,

STYPE = int[],

INITCOND = ’{}’,FINALFUNC = my_final

);


Using our new aggregate

SELECT year, production, artificial_avg(production)

OVER (ORDER BY year

ROWS BETWEEN 3 PRECEDING AND 0 FOLLOWING)


I the aggregate can be used just like any other aggregate in thesystem


Finally


Thank you for your attention

Cybertec Schonig & Schonig GmbHGrohrmuhlgasse 26A-2700 Wiener Neustadt

www.postgresql-support.de


Software

PostgreSQL: Data analysis and analytics