Upload
hans-juergen-schoenig
View
182
Download
0
Embed Size (px)
Citation preview
Cybertec Training: Data Analysis
Hans-Jurgen Schonig
www.postgresql-support.de
Hans-Jurgen Schonigwww.postgresql-support.de
Introduction
Hans-Jurgen Schonigwww.postgresql-support.de
Scope of this training
I Importing dataI Simple aggregationsI Windowing and analyticsI Analyzing time seriesI Managing incomplete dataI Writing custom aggregates
Hans-Jurgen Schonigwww.postgresql-support.de
Importing data
Hans-Jurgen Schonigwww.postgresql-support.de
Loading data
Things to consider when importing data
I There are many ways to import dataI Avoid mini-transactions for performance reasonsI In case of large data sets speed is a major issueI There is a life after importing
Hans-Jurgen Schonigwww.postgresql-support.de
Importing a simple data set
A simple data structure . . .
test=# CREATE TABLE t_test (a int, b int);
CREATE TABLE
Let us add 10.000 rows now . . .
INSERT INTO t_test VALUES (1, 2);
...
INSERT INTO t_test VALUES (1, 2);
Hans-Jurgen Schonigwww.postgresql-support.de
Using one transaction to import things
BEGIN;
INSERT INTO t_test VALUES (1, 2);
...
INSERT INTO t_test VALUES (1, 2);
COMMIT;
Hans-Jurgen Schonigwww.postgresql-support.de
Observations
I Performance can vary depending on hardware
I Longer transactions can be WAYS faster
I PostgreSQL has to flush every transaction to disk
I Most of the time is burned by flushing
Hans-Jurgen Schonigwww.postgresql-support.de
Changing durability requirements
I Performance will sky rocket . . .
SET synchronous_commit TO off;
INSERT INTO t_test VALUES (1, 2);
...
INSERT INTO t_test VALUES (1, 2);
I The reason is that PostgreSQL does not have to flush everytransaction anymore.
I Trading “durability” for performance
Hans-Jurgen Schonigwww.postgresql-support.de
Use bulk loads
I Loading single rows is usually a bad ideaI Use COPY to do bulk loadingI COPY can load data A LOT faster than INSERT due to
significantly smaller overhead
Hans-Jurgen Schonigwww.postgresql-support.de
A simple COPY
This time 10 million lines are imported: (10.000 rows are notenough)
COPY t_test FROM stdin;
1 2
1 2
...
\.
Note the performance difference (rows per second). There is noneed to check column lists, existence of table, etc. anymore ->higher throughput
Hans-Jurgen Schonigwww.postgresql-support.de
COPY: Observations to be made
I In the default configuration (= checkpoint segments = 3) youwill see a steady up and down of I/O speed
I This is caused by checkpoints happening in the backgroundI Data has to go to the transaction log to “repair” data files in
case of a crashI Performance is limited by writing data twice
Hans-Jurgen Schonigwww.postgresql-support.de
COPY: Using a transaction log bypass
I Writing to the transaction log can be avoided in some cases:
BEGIN;
TRUNCATE t_test;
COPY t_test FROM stdin;
...
COMMIT;
Hans-Jurgen Schonigwww.postgresql-support.de
COPY: Why the bypass works
I TRUNCATE will schedule the removal of the data file onCOMMIT
I COPY will start writing to a new data file
I Concurrency is not an issue because TRUNCATE locks thetable
I PostgreSQL can take the old or the new data file on COMMITor ROLLBACK
I no need to actually repair a data file anymore
Hans-Jurgen Schonigwww.postgresql-support.de
COPY: More on WAL-bypassing
I BEGIN;
CREATE TABLE ...
I WAL bypassing only works if you are not using streamingreplication
Hans-Jurgen Schonigwww.postgresql-support.de
Freezing rows
I Before you import data, run . . .
ALTER TABLE t_test
SET (autovacuum_enabled = off);
I Compare the timing of the first
SELECT count(id) FROM t_test;
with the second one.
Hans-Jurgen Schonigwww.postgresql-support.de
Observations
I The first run is a lot slower
I During the first run, writes will happen
I No more writes from the second run on
I PostgreSQL sets bits in the background
Hans-Jurgen Schonigwww.postgresql-support.de
The purpose of hint bits
I On first write PostgreSQL checks if a row can be seen byeverybody.
I This bit is set to make sure that PostgreSQL does not have togo through expensive visibility checks next time.
I This is an issue for big data sets
Hans-Jurgen Schonigwww.postgresql-support.de
Fixing after-import performance
I To set hint bits straight away do a
test=# COPY t_test FROM ’/tmp/file.txt’ FREEZE;
I Be careful. It only works in case . . .
ERROR: cannot perform FREEZE because the table
was not created or truncated in the
current subtransaction
Hans-Jurgen Schonigwww.postgresql-support.de
VACUUM and hint bits
I Bits can be set on an entire block as well (not just on rows)
I VACUUM will set those hint bits
I However, block bits will not speed things up as much asrow-level bits usually
I For a heavily used read-only database system vacuuming datacan actually make sense (not to reclaim space)
Hans-Jurgen Schonigwww.postgresql-support.de
Importing some test data:
Here is some test data:
CREATE TABLE t_oil (country text,
year int,
production int);
COPY t_oil FROM PROGRAM
’curl www.cybertec.at/secret/oil.txt’;
Hans-Jurgen Schonigwww.postgresql-support.de
Simple aggregations
Hans-Jurgen Schonigwww.postgresql-support.de
Basic aggregation
test=# SELECT country, avg(production)
FROM t_oil
GROUP BY 1;
country | avg
---------------+-----------------------
USA | 9141.3478260869565217
Saudi Arabien | 7641.8260869565217391
(2 rows)
Hans-Jurgen Schonigwww.postgresql-support.de
GROUP BY is needed for aggregates
I A GROUP BY clause is needed because otherwise groupscannot be built:
test=# SELECT country, avg(production) FROM t_oil;
ERROR: column "t_oil.country" must appear in the
GROUP BY clause or be used in an
aggregate function
LINE 1: SELECT country, avg(production) FROM t_oil;
Hans-Jurgen Schonigwww.postgresql-support.de
HAVING: Filtering on aggregated data
test=# SELECT country, avg(production)
FROM t_oil
GROUP BY 1
HAVING avg(production) > 8000;
country | avg
---------+-----------------------
USA | 9141.3478260869565217
(1 row)
I NOTE that an alias is not allowed in a HAVING clause
Hans-Jurgen Schonigwww.postgresql-support.de
Windowing and analytics
Hans-Jurgen Schonigwww.postgresql-support.de
The purpose of windowing
An analogy:
I Your car is not a valuable car because it is goodI It is valuable because it is better than the ones driven by your
friends
This is what windowing does: The current row in relation to allrows in the reference group
Hans-Jurgen Schonigwww.postgresql-support.de
Windowing vs GROUP BY
I GROUP BY has been designed to reduce the amount of dataand turn it into aggregated values
I Windowing is used to compare values and put them intorelation.
I Windowing is used along with aggregate functions (e.g. sum,count, avg, min, max, . . . )
Hans-Jurgen Schonigwww.postgresql-support.de
A simple aggregate: Average values
SELECT *, avg(production) OVER () FROM t_oil ;
country | year | production | avg
---------------+------+------------+---------------
USA | 1965 | 9014 | 8391.58695652
USA | 1966 | 9579 | 8391.58695652
USA | 1967 | 10219 | 8391.58695652
USA | 1968 | 10600 | 8391.58695652
USA | 1969 | 10828 | 8391.58695652
...
Hans-Jurgen Schonigwww.postgresql-support.de
What does the result mean?
I ‘Give me all rows and the average “over” all rows in the table’
I Logically it is the same as . . .
SELECT *, (SELECT avg(production) FROM t_oil) AS avg
FROM t_oil;
I However, subselects can be very nasty if the task is a morecomplex one
Hans-Jurgen Schonigwww.postgresql-support.de
OVER()-clauses can define order
I Calculate max production up to a certain point
SELECT *, max(production) OVER (ORDER BY year)
FROM t_oil
WHERE country = ’Saudi Arabien’;
I Saudi Arabia is a so called ‘swing producer’.
I Note that max stays up even if production declines
Hans-Jurgen Schonigwww.postgresql-support.de
OVER()-clauses can form groups
I Averages for each country
SELECT *, avg(production)
OVER (PARTITION BY country)
FROM t_oil;
country | year | production | avg
---------------+------+------------+---------------
Saudi Arabien | 1965 | 2219 | 7641.82608695
Saudi Arabien | 1966 | 2615 | 7641.82608695
...
USA | 1965 | 9014 | 9141.34782608
USA | 1966 | 9579 | 9141.34782608
...
Hans-Jurgen Schonigwww.postgresql-support.de
Forming groups
I Data is split into groupsI Each row shows the average of all rows in its groupI Note that we got one group (= window) per country
Hans-Jurgen Schonigwww.postgresql-support.de
OVER() can contain order and groups
SELECT *, max(production)
OVER (PARTITION BY country ORDER BY year)
FROM t_oil;
I In this case we get the maximum up to a given pointI This is done for each country
Hans-Jurgen Schonigwww.postgresql-support.de
Abstracting window-clauses
SELECT *,
min(production) OVER (w),
max(production) OVER (w),
count(production) OVER (w)
FROM t_oil
WINDOW w AS (PARTITION BY country ORDER BY year)
I the same clause can be used for many columnsI many window-clauses may exist (w, w2, w3, etc.)
Hans-Jurgen Schonigwww.postgresql-support.de
rank() and dense rank()
I Data can be ranked according to some order
I In case of duplicates
I rank gives 1, 2, 2, 2, 5I dense rank gives 1, 2, 2, 2, 3
Hans-Jurgen Schonigwww.postgresql-support.de
Moving rows: lead
I ORDER BY defines into which direction to “move” the row
I the number defines the offset
SELECT *, lag(production, 1) OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | lag
---------+------+------------+-------
USA | 1965 | 9014 |
USA | 1966 | 9579 | 9014
USA | 1967 | 10219 | 9579
USA | 1968 | 10600 | 10219
Hans-Jurgen Schonigwww.postgresql-support.de
Calculating the change in production
I Very easy thing to do now
SELECT *, production - lag(production, 1)
OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | ?column?
---------+------+------------+----------
USA | 1965 | 9014 |
USA | 1966 | 9579 | 565
USA | 1967 | 10219 | 640
USA | 1968 | 10600 | 381
Hans-Jurgen Schonigwww.postgresql-support.de
lead is the opposite of lag
I lag is the same as ‘lead(. . . , -1)’I lag pushes elements downI lead pushes elemens up
Hans-Jurgen Schonigwww.postgresql-support.de
moving entire rows
SELECT *, lag(t_oil, 1) OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | lag
---------+------+------------+------------------
USA | 1965 | 9014 |
USA | 1966 | 9579 | (USA,1965,9014)
USA | 1967 | 10219 | (USA,1966,9579)
USA | 1968 | 10600 | (USA,1967,10219)
I the composite type can then be disected using a subselect forthe current query
Hans-Jurgen Schonigwww.postgresql-support.de
works for more than just one column
SELECT *, lag((year, production), 1)
OVER (ORDER BY year)
FROM t_oil WHERE country = ’USA’;
country | year | production | lag
---------+------+------------+--------------
USA | 1965 | 9014 |
USA | 1966 | 9579 | (1965,9014)
USA | 1967 | 10219 | (1966,9579)
USA | 1968 | 10600 | (1967,10219)
I this is the perfect foundation to build custom aggregates tosolve complex problems
Hans-Jurgen Schonigwww.postgresql-support.de
Splitting data into equal parts
I ntile can split your data into n equally sized blocksI ntile(4) will therefore give you a nice quantile distributionI order is needed to achieve that
Hans-Jurgen Schonigwww.postgresql-support.de
Here is how it works . . .
SELECT year, production, ntile(4)
OVER (ORDER BY production)
FROM t_oil WHERE country = ’USA’ ORDER BY 3, 2 DESC;
year | production | ntile
------+------------+-------
2000 | 7733 | 1
1999 | 7731 | 1
...
1966 | 9579 | 2
1989 | 9159 | 2
...
1972 | 11185 | 4
Hans-Jurgen Schonigwww.postgresql-support.de
Work can proceed from there
SELECT ntile, min(production), max(production)
FROM ( SELECT year, production, ntile(4)
OVER (ORDER BY production)
FROM t_oil WHERE country = ’USA’) AS x
GROUP BY 1 ORDER BY 1
Hans-Jurgen Schonigwww.postgresql-support.de
The query returns nice quantiles
ntile | min | max
-------+-------+-------
1 | 6734 | 7733
2 | 8011 | 9579
3 | 9736 | 10231
4 | 10247 | 11297
(4 rows)
Hans-Jurgen Schonigwww.postgresql-support.de
Moving averages
I More sophisticated frame-clauses are neededI The average is done for 2 years = current + previous one
SELECT *, avg(production) OVER (ORDER BY year ROWS
BETWEEN 1 PRECEDING AND 0 FOLLOWING)
FROM t_oil WHERE country = ’Saudi Arabien’;
country | year | production | avg
---------------+------+------------+------------
Saudi Arabien | 1965 | 2219 | 2219.0000
Saudi Arabien | 1966 | 2615 | 2417.0000
Saudi Arabien | 1967 | 2825 | 2720.0000
Hans-Jurgen Schonigwww.postgresql-support.de
Combining joins, aggregates, and windowing
Hans-Jurgen Schonigwww.postgresql-support.de
Combining data
I To combine data we need some more import data
CREATE TABLE t_president
(name text,
start_year int,
end_year int,
party text);
Hans-Jurgen Schonigwww.postgresql-support.de
Some input data
I A list of all American presidents and their presidency
test=# COPY t_president FROM PROGRAM
’curl www.cybertec.at/secret/president.txt’;
COPY 9
I The format is not too nice for analysis
Hans-Jurgen Schonigwww.postgresql-support.de
Input data: American presidents
SELECT * FROM t_president ;
name | start_year | end_year | party
-------------------+------------+----------+------------
Lyndon B. Johnson | 1963 | 1969 | Democrat
Richard M. Nixon | 1969 | 1974 | Republican
Gerald Ford | 1974 | 1977 | Republican
Jimmy Carter | 1977 | 1981 | Democrat
Ronald W. Reagan | 1981 | 1989 | Republican
George H. W. Bush | 1989 | 1993 | Republican
Bill Clinton | 1993 | 2001 | Democrat
George W. Bush | 2001 | 2009 | Republican
Barack Obama | 2009 | 2017 | Democrat
Hans-Jurgen Schonigwww.postgresql-support.de
The challenge: Adjust the format
I LATERAL can come to the rescue
SELECT name, party, year
FROM t_president AS x,
LATERAL (SELECT * FROM
generate_series(x.start_year, x.end_year - 1)
AS year) AS y
LIMIT 8;
Hans-Jurgen Schonigwww.postgresql-support.de
The output is:
name | party | year
-------------------+------------+------
Lyndon B. Johnson | Democrat | 1963
Lyndon B. Johnson | Democrat | 1964
Lyndon B. Johnson | Democrat | 1965
Lyndon B. Johnson | Democrat | 1966
Lyndon B. Johnson | Democrat | 1967
Lyndon B. Johnson | Democrat | 1968
Richard M. Nixon | Republican | 1969
Richard M. Nixon | Republican | 1970
Richard M. Nixon | Republican | 1971
Richard M. Nixon | Republican | 1972
Hans-Jurgen Schonigwww.postgresql-support.de
Which party is better for oil?
I The following way to solve the problem is definitely not theonly one.
I There might be other factors than the party of the presidentwhen it comes to this kind of data.
I Keep in mind: It is just an SQL exercise
Hans-Jurgen Schonigwww.postgresql-support.de
Putting things together (1)
CREATE VIEW v AS
WITH b AS (
SELECT name, party, year
FROM t_president AS x,
LATERAL (SELECT * FROM generate_series(
x.start_year,
x.end_year - 1) AS year) AS y)
SELECT a.*, party,
production - lag(production, 1)
OVER (ORDER BY a.year) AS lag
FROM t_oil AS a, b
WHERE a.year = b.year AND country = ’USA’;
Hans-Jurgen Schonigwww.postgresql-support.de
What we got so far
SELECT * FROM v;
country | year | production | party | lag
---------+------+------------+------------+------
USA | 1965 | 9014 | Democrat |
USA | 1966 | 9579 | Democrat | 565
USA | 1967 | 10219 | Democrat | 640
USA | 1968 | 10600 | Democrat | 381
USA | 1969 | 10828 | Republican | 228
USA | 1970 | 11297 | Republican | 469
Hans-Jurgen Schonigwww.postgresql-support.de
Making use of NULL
I Remember NULL is ignored by aggregate functionsI We can use that to do ‘partial counts’
SELECT party, lag,
CASE WHEN lag > 0 THEN 1 END AS up,
CASE WHEN lag < 0 THEN 1 END AS down
FROM v
ORDER BY year;
Hans-Jurgen Schonigwww.postgresql-support.de
Which gives us . . .
party | lag | up | down
------------+------+----+------
Democrat | | |
Democrat | 565 | 1 |
Democrat | 640 | 1 |
Democrat | 381 | 1 |
Republican | 228 | 1 |
Republican | 469 | 1 |
Republican | -141 | | 1
Republican | 29 | 1 |
Republican | -239 | | 1
Republican | -485 | | 1
Hans-Jurgen Schonigwww.postgresql-support.de
We can move on from there easily
SELECT party,
count(CASE WHEN lag > 0 THEN 1 END) AS up,
count(CASE WHEN lag < 0 THEN 1 END) AS down
FROM v
GROUP BY party;
party | up | down
------------+----+------
Democrat | 9 | 8
Republican | 10 | 18
(2 rows)
Hans-Jurgen Schonigwww.postgresql-support.de
Handling missing data
Hans-Jurgen Schonigwww.postgresql-support.de
Preparing our sample data
test=# UPDATE t_oil
SET production = NULL
WHERE year IN (1998, 1999)
AND country = ’USA’ RETURNING *;
country | year | production
---------+------+------------
USA | 1998 |
USA | 1999 |
(2 rows)
Hans-Jurgen Schonigwww.postgresql-support.de
Challenges ahead
I How can we make lead and lag work again?I How can we fill the gaps?I How can we control our behavior in a more efficient way?
Hans-Jurgen Schonigwww.postgresql-support.de
Turning to frame-clauses once again
I One idea is to just use the average of some previous valuesI However, you might also want to turn to interpolation or
outright guess workI A custom aggregate might help
Hans-Jurgen Schonigwww.postgresql-support.de
A ‘lazy’ idea
I Creating an array with some historic values
I Applying a function on this array
SELECT year, production, array_agg(production)
OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING
AND 0 FOLLOWING)
FROM t_oil
WHERE country = ’USA’;
Hans-Jurgen Schonigwww.postgresql-support.de
Which gives us . . .
... snip ...
1995 | 8322 | {8868,8583,8389,8322}
1996 | 8295 | {8583,8389,8322,8295}
1997 | 8269 | {8389,8322,8295,8269}
1998 | | {8322,8295,8269,NULL}
1999 | | {8295,8269,NULL,NULL}
2000 | 7733 | {8269,NULL,NULL,7733}
2001 | 7669 | {NULL,NULL,7733,7669}
2002 | 7626 | {NULL,7733,7669,7626}
2003 | 7400 | {7733,7669,7626,7400}
... snip ...
Hans-Jurgen Schonigwww.postgresql-support.de
Applying a function
I A simple function could look like this:
SELECT avg(x)
FROM unnest(’{8295,8269,NULL,NULL}’::int4[]) AS x;
avg
-----------------------
8282.0000000000000000
(1 row)
Hans-Jurgen Schonigwww.postgresql-support.de
A query could therefore look like this
SELECT *, (SELECT avg(x) FROM unnest(array_agg) AS x)
FROM (SELECT year, production, array_agg(production)
OVER (ORDER BY year ROWS BETWEEN 3 PRECEDING
AND 0 FOLLOWING)
FROM t_oil WHERE country = ’USA’) AS y
OFFSET 32 LIMIT 4;
year | production | array_agg | avg
------+------------+-----------------------+-------------
1997 | 8269 | {8389,8322,8295,8269} | 8318.750000
1998 | | {8322,8295,8269,NULL} | 8295.333333
1999 | | {8295,8269,NULL,NULL} | 8282.000000
2000 | 7733 | {8269,NULL,NULL,7733} | 8001.000000
Hans-Jurgen Schonigwww.postgresql-support.de
Defining an aggregate
I Defining an aggregate is really the more desirable way
I It is ways more clean
I CREATE AGGREGATE is your friend
Hans-Jurgen Schonigwww.postgresql-support.de
A simple example
I the aggregate can be created like this:
CREATE FUNCTION my_final(int[]) RETURNS numeric AS
$$
SELECT avg(x) FROM unnest($1) AS x;
$$ LANGUAGE sql;
CREATE AGGREGATE artificial_avg(int) (
SFUNC = array_append,
STYPE = int[],
INITCOND = ’{}’,FINALFUNC = my_final
);
Hans-Jurgen Schonigwww.postgresql-support.de
Using our new aggregate
SELECT year, production, artificial_avg(production)
OVER (ORDER BY year
ROWS BETWEEN 3 PRECEDING AND 0 FOLLOWING)
FROM t_oil WHERE country = ’USA’;
I the aggregate can be used just like any other aggregate in thesystem
Hans-Jurgen Schonigwww.postgresql-support.de
Finally
Hans-Jurgen Schonigwww.postgresql-support.de
Thank you for your attention
Cybertec Schonig & Schonig GmbHGrohrmuhlgasse 26A-2700 Wiener Neustadt
www.postgresql-support.de
Hans-Jurgen Schonigwww.postgresql-support.de