Upload
alexei-krasner
View
496
Download
6
Embed Size (px)
Citation preview
Alexei KrasnerNov 2015
PostgreSQL as MSSQL Alternative
What is PostgreSQL▪ Powerful, open source object-relational database system.▪ 15 years of active development and strong reputation.▪ Runs on all major operating systems (Linux, Unix, Mac
OS, Windows…).▪ Enterprise class database.▪ Large and responsive community.▪ Winner of the 2015 Database Trends and Applications
Readers Choice:– The most advanced open source database.– Best relational database.
Lets Start With Standards▪ Fully ACID compliant.▪ Includes most of SQL:2008 data types along with
storage of binary objects.▪ Conforms to the ANSI-SQL:2008 standard:– Full support for subqueries (including sub-selects).– Read-Committed and serializable transaction isolation levels.– Full support for Primary keys, Foreign Keys, Joins, Views, Triggers,
Stored Procedures, Restrictions (check, unique and not null) and Cascading.
– Fully relational system catalog – multiple schema per database.▪ Native programming interfaces: Java, .NET, C/C++, Perl,
Python, ODBC
Continue With a Little of Splurging▪ Multi-Version Concurrency Control (MVCC).▪ Asynchronous Replication, Load Balancing and Online/Hot Backups with
Point in Time Recovery.▪ Write Ahead Logging – fault tolerance.▪ Performance:
– Sophisticated Query Planner/Optimizer.– Compound, Unique, Partial and functional indexes.
▪ Supports: – International character sets, multi-byte encodings, Unicode, locale awareness.– Built-in Types – Geospatial, XML, JSON\JSONB, Ranges and Arrays!– NoSQL – Key-Value store with incredible performance and Full Text Search.
▪ Highly customizable and extensible.
Before We Dive – Generalized Search Tree (GiST)▪ Advanced indexing system – different sorting and
searching algorithms:– B-tree, B+-tree, R-tree, Partial Sum trees, ranked B+-trees etc.– API for creating custom data types and extensible query methods
for search.▪ Decide WHAT to persist, HOW to persist and a way to
SEARCH for it.▪ Exceeds the general search algorithms using standard
B\R-trees.▪ Foundation for many public projects – OpenFTS and
PostGIS
Features Deep Dive
▪ MVCC▪ Partitioning▪ Useful Data Types– Date and Time– Interval– Array– Ranges– JSON– HSTORE– XML
▪ PostGIS – Geographic
▪ Full Text Search▪ Server Side
Programming▪ Backup and Restore▪ High Availability,
Load Balancing and Replication– Sharding
▪ Big Data Readiness
Multi Version Concurrency Control - MVCC▪ Reads should never block writes and
vice versa.▪ Each transaction sees a snapshot of
data (version).– Protection from viewing inconsistency –
transaction isolation.▪ Avoidance of explicit locking solutions
– minimize lock contention.▪ Table\Row level locking mechanism is
still available – although proper MVCC usage will provide performance benefits.
Partitioning – Table Inheritance▪ Support of basic table partitioning via the table
inheritance concept.– Includes known partitioning benefits:▪ Improved heavy load query performance (on a single partition).▪ Sequential scan of a partition instead of index usage.▪ Bulk loads and deletes accomplished by adding or removing partitions.▪ Infrequent data can be migrated to a cheaper\slower storage solution.
– Range Partitioning:▪ Table partitioned into “ranges” defined by a single\set key column (e.g.
dates).– List Partitioning:▪ Table partitioned into a list of discrete values as partitioning keys.
– Hundred partitions is an acceptable limit, thousands of partitions will crucially harm performance.
Useful Data Types▪ Date and Time – Date, Time, TimeStamp and
TimeStamp with zone.– Converted to and from Unix time.– Supports the INTERVAL type.– Very convenient casting and conversion to text.– Performance wise searching and sorting algorithms (including
zone\offset).▪ INTERVAL – representation of a period of time.– Possible negative interval values (e.g. year ago).– Intuitive arithmetic and persistence of time durations– Easy casting and converting to relevant types.– Performance wise searching and sorting algorithms on intervals.
Useful Data Types Cont.▪ Array – supported as first-class datatype (actual field in
a table).– Contain any datatype (sub arrays too).– Parameters to functions as an array.– Usages – Functions results, aggregations, get\set array of data in\
from the application.▪ Range – Supported as first-class datatype.– Put range on TIME, INT or NUMERIC as a single data value.– Possible dedicated indexes to support queries utilizing ranges.– Exposed methods to define custom ranges.
Useful Data Types Cont.▪ JSON – full support along with large dedicated set of utility
functions.– Known JSON\JSONB benefits – data transfer and integration
standard.– Transformation from\to types and tables.– Retrieval and construction of JSON data.– Parsing, casting and conversion.
▪ HSTORE – Fast key-value store as a datatype.– NoSQL capabilities – flexibility of schema-less data store.– Still ACID compliant.– Interchange data between JSON and HSTORE.
Useful Data Types Cont.▪ XML – Supported as a first-class datatype.– Check well formedness + type-safe operations.– Querying using Xpath.– Producing XML content, Predicates, Processing, Mapping tables to
XML etc.
PostGIS▪ Fully featured, reliable geospatial database project base on GiST
(Following ISO OGC)▪ SQL types and functions to manage vector geometries (spatial
data).▪ Capabilities:– Support for three dimensional data.– Support for geospatial formats (KML, GeoJSON)– Processing and analytics functions for vector and raster data.– Map “rastering” and geo queries.– Geo searches and reverse geo searches.
▪ Huge popularity and respect extension module – compered to ArcGIS
Full Text Search▪ Online indexing of data and relevance ranking for
database searches.▪ Good Enough:– Stemming– Ranking– Multilingual– Fuzzy searches (misspelling)\ Accent.
Server Side Programming▪ Super Extensible – functions, data types, procedural
languages, operators, aggregates etc.– Embedding Functions and Stored Procedures using procedural– PL/pgSQL, PL/Tcl, PL/Perl, PL/Python
▪ Triggers – tables, views and foreign tables.▪ Event Triggers – database global trigger.▪ Rule System – Query modification based on given rules.
Backup and Restore▪ Extremely flexible dump utility – migration, replication
and backups becomes more reliable, controllable and configurable.– Compressed format or plain SQL (human readable).– Single table or whole database cluster.
▪ Approaches:– SQL Dump – file with generated SQL commands. On restore the
backed up commands will be replayed.– File system level backup – direct copy of PostgreSQL data files.
Restore will include reattaching the data files.– Continuous archiving – backing up Write Ahead Log (WAL) files.
On restore log commands will be replayed.
High Availability, Load Balancing and ReplicationFeature Shared Disk
FailoverFile System Replication
Transaction Log Shipping
Trigger-Based Master-Standby Replication
Statement-Based Replication Middleware
Asynchronous Multimaster Replication
Synchronous Multimaster Replication
Most Common Implementation NAS DRBD Streaming Repl. Slony pgpool-II Bucardo
Communication Method shared disk disk blocks WAL table rows SQL table rows table rows and row
locksNo special hardware required X X X X X X
Allows multiple master servers X X X
No master server overhead X X X
No waiting for multiple servers X with sync off X X
Master failure will never lose data X X with sync on X X
Standby accept read-only queries with hot X X X X
Per-table granularity X X XNo conflict resolution necessary
X X X X X
Sharding and Replication▪ Pure Sharding:– pg_shard – popular sharding extension for PostgreSQL.▪ Running on Linux!
– BDR/UDR Project – Bi-Directional Replication which adds multi-master replication to PostgreSQL.▪ Running on Linux! Migration to windows only in a non-near future.▪ Forked of the main PostgreSQL source.
– Postgres-XL – all purpose fully ACID open source scale-out db solution. ▪ Running on Linux!▪ Forked of the main PostgreSQL source.
Sharding and Replication Cont.▪ Via Replication:– Hot Standby – Reducing read loads from Master to slaves
(horizontal scale).– Streaming (or Bucardo, or other possible option) replication to
slaves.– Load balancing “write” queries to Master, “read” queries to
slaves.
PostgreSQL and Big Data▪ PostgreSQL was used a decade before Hadoop launched, for
large data volumes and complex analytics (as the only pure open source).
▪ Today heavily used in mid-sized warehouses and data-marts (1-10 TB).
▪ Source of code for many big data systems:– Netezza (IBM).– Greenplum (Pivotal) – Open Source Massively Parallel Data Warehouse.– PipelineDB – open source, run SQL queries continuously on streaming data.– EnterpriseDB and CitusDB (commercial license) – fully scaled out Postgres.– Redshift (Amazon).
▪ PostgreSQL project continuously provide new features and better performance to support big data usage.
PostgreSQL and Big Data – Features▪ Serious NoSQL database competitor.– JSON\B advanced features and ongoing massive development plan .– Extensions that provide NoSQL like API.
▪ Faster Sorts – text and long numeric sorting improvements.▪ TABLESAMPLE – result set of pseudo-random number of
rows to provide a data glimpse for further analysis.▪ Cubes, Rollups and Grouping Sets – summarizing and
exploring huge data sets in the OLAP way.▪ BRIN indexes – much faster, suits for TBs size tables on
incrementally increasing value fields (like timestamps or integers).
PostgreSQL and Big Data – Features Cont.▪ Foreign Data Wrappers – linking external data (for
querying like local) for hybrid solutions.– Foreign schema import.– JOIN pushdowns
▪ Vacuum (garbage collection – deleting) – became parallel with multi-process mode (maintaining several large tables at once).
▪ Scaling UP – Multicore scalability improvements.
Enterprise Wise
▪ Open Source▪ Reliability▪ Authentication▪ Logging▪ Documentation▪ Support▪ Maintenance
Open Source▪ Available under the open source license – PostgreSQL
License.▪ Using, modifying and distributing in any open\close
form.▪ Extending and patching the relational database per
project\client etc.▪ Variety of modules, extensions and tools based on its
open source license.
Reliability▪ PostgreSQL is relatively bug-free (compared to MSSQL).▪ Very large community reporting, fixing\workarounds
bugs.▪ Constantly growing community
Authentication▪ Trust Authentication.▪ Password Authentication.▪ GSSAPI\SSPI Authentication – using Kerberos.▪ Ident Authentication.▪ Peer Authentication.▪ LDAP Authentication▪ RADIUS Authentication.▪ Certificate Authentication.▪ Pluggable Authentication Modules.
Logging▪ Logs in one place.– Unlike MSSQL – error logs, event log, profiler log, agent log…
▪ Easily configurable logging level.▪ Easily redirect to CSV files and shipped to tables.▪ Easily redirect to System Log, Windows Event Log.▪ Logs are human readable with a great sysadmin value.
Documentation▪ There is nothing more to add than a link:
http://www.postgresql.org/docs/
Support▪ Community based support – seems like a fast one too.▪ Numerous companies specialized in enterprise support:
http://www.postgresql.org/support/professional_support/▪ Enterprise database management companies like:
EnterpriseDB▪ Total Cost of Ownership is significantly lower even with
enterprise support. (Based on reports. e.g. Gartner 2015).
vs. MySQL
▪ ACID fully! compliant.▪ Subqueries and Joins.▪ Better locking mechanism.▪ JSON\JSONB support.▪ NoSQL and Key-Value store.▪ Advanced GIS abilities.▪ Full Text Search abilities.▪ Advanced and attractive data types.▪ Way better and useful extensibility patterns. ▪ Licensing issues.
vs. PostgreSQL
▪ Partitioning based on table inheritance (Pros. and Cons.)
▪ Can be an overkill in case of simple read-heavy operations. (Improved in newer versions).
▪ Replication and Clustering (especially multi-master). Not “there” yet, but on a right track.
▪ Popularity – not as popular as MySQL (for example) but gains popularity constantly, as opposite to MySQL.
▪ Expertise issues – different syntax and administration (compared to MSSQL).
THANK YOU