Forgive my stupid question.
I want to know what are the differences between postgreSQL database cluster (1 server, N databases, N ports) vs one server hosting multiple databases (1 server, N databases, 1 port)?
I have read the clustering here and also here and this too. I just don't get it. Even weirder to me (due to my lack of knowledge), on OpDash it says cluster can run different versions of postgres at the same time. :(
Any samples of database clustering usage in the real-world? Suppose I want to do High Availibility with Master - Slave replication, do I need database clustering for this?
please enlighten me or point me to a direction/article.
Thank you
-
1In Postgres terminology "one server with many databases" is called a cluster. "1 server, N databases, N ports" implies multiple instances and therefore multiple clusters. This has absolutely nothing to do with HA or replication. Please clarify what you're asking.mustaccio– mustaccio2022年10月19日 16:34:37 +00:00Commented Oct 19, 2022 at 16:34
-
@mustaccio . so, if I have 1 server with N databases (just normal databases, like HR, Accounting etc.. no Template0 nor Template1) in it and all databases sharing the same port (all databases are accessed via port 5432), is it called clustering too ? Thankspadjee– padjee2022年10月20日 03:30:34 +00:00Commented Oct 20, 2022 at 3:30
-
@padjee - I put in some extra bits to my answer which explain the cluster/database/schema setup in PostgreSQL a bit more - I've also included some links on HA.Vérace– Vérace2022年10月20日 14:32:18 +00:00Commented Oct 20, 2022 at 14:32
1 Answer 1
You ask:
I want to know what are the differences between postgreSQL database cluster (1 server, N databases, N ports) vs one server hosting multiple databases (1 server, N databases, 1 port)?
Short answer:
a PostgreSQL cluster is an instance (sensu Oracle) which can have 1 or many databases - a single instance operates on a single port - 1 server, 1-N databases, 1 port).
a given server can have multiple instances provided that:
(a) they are on different ports - 1 server, N instances, N ports - many databases (N times x if you wish)
(b) the instances have different data directories - potential corruption if not!
Long answer:
A PostgreSQL cluster is an "instance", (this is not the "normal" definition of cluster - see below) working on one machine.
You can have a PostgreSQL instance (cluster) with just one database, apart from the two templates (see below) and a "working" database called postgres
(created on cluster/instance initialisation).
After running initdb
, you'll have 3 databases - two templates and one "working" database called postgres
- some 3rd party utilities expect it to be there, but you can delete it if required or desirable (security).
You should never touch template0
- it can render your system inoperable - here's blog about how to copy databases from templates. The templates are "skeletons" - from which you create databases - all of the settings (from postgresql.conf
) and system catalogs are there, but no ordinary tables!
If you issue a \d
from within template1, you'll receive the message: Did not find any relations.
It does have system catalogs (AKA system tables AKA data dictionary AKA db meta-data) - issue a \d *
to see these. If you always use, say, PostGis and Python, you might install them on template1, and then every new database you create will have these installed by default. Template0 can be seen as a backup for template1 - never to be changed, only copied!
All of the PostgreSQL databases you create can have one or more "schemas" (a further sub-division - cluster ⊃ database(s) ⊃ schema(s)) that you may also create. Any database can have multiple schemas - logical separation of functions - hr, accts, stock &c., i.e within the same db.
Create a new database - CREATE DATABASE test;
and then run \c test
(prompt will change) and then \dn
to list its schemas - there will be one - public
- this may be deleted if required.
So, overall:
On the same machine, you can have 1 or as many clusters (PostgreSQL definition - see discussion of clusters below) as you want (within reason) using different ports and a different data directory for each cluster.
Production machines would typically use port 5432 and dev/UAT machines might have a few clusters (i.e. instances) using different ports - running small test databases isn't very resource intensive!
All of these databases can have their own (set of) schema(s) - so you could have (for example) 3 (PostgreSQL definition of) clusters running on ports 5432, 5433 and 5434, each with an hr schema, an accounting schema (as many schemas as you want - within reason).
You are not obliged to create (a) schema(s) - it can be helpful for the logical separation of large databases into their constituent sections (c.f. hr/accts...)
Re clusters!
I think I see the reason for the confusion re clusters/databases/schemas!
PostgreSQL's origins date back quite a while - it derives from Ingres:
Ingres began as a research project at UC Berkeley, starting in the early 1970s and ending in 1985.
That's almost a full decade before Oracle's first release in 1979. It uses an older vocabulary than most systems do.
Notice the terms that I've used:
System
catalogs
(rather than the more usual system "tables" - AKA )PostgreSQL uses the term
relations
(rather than "tables" - PostgreSQL makes a distinction between system tables (catalogs) and ordinary tables (relations)).
PostgreSQL people are fond of using other terms i.e. tuple
which has largely been replaced by "record" and/or "row" and attribute
which has been replaced by "column" in other systems and in general usage. These changes have possibly been driven by the ubiquity of spreadsheets!
These terms derive from relational calculus & relational algebra which (in the field of db's) largely derives from a paper written by Ted Codd which uses mathematical language. The originator of the Ingres system was Michael Stonebraker, an academic (and commercial success), hence the retention of (what might be considered overly) academic terms.
Nowadays, a "cluster
" is considered to be:
A computer cluster is a set of computers that work together so that they can be viewed as a single system.
Not the PostgreSQL definition - some history!
The best definition of a cluster for PostgreSQL is PostgreSQL's own definition:
A database cluster is a collection of databases that is managed by a single instance of a running database server.
Note, there is nothing there about multiple machines - it is a single instance of a running database server! One can have many (PostgreSQL) clusters (i.e. instances) on a single machine - the PG definition is, in some ways, the inverse of what is more commonly accepted as the definition of a cluster (paraphrasing):
- Normal definition: Many machines, one system
- PostgreSQL definiton: Many systems, one machine
Re HA
Doing replication (e.g. with Master/Slave) will have two PostgreSQL clusters located on different machines - which may involve more than one database, but as I said, in PROD, it's normally dedicated to one database (alongside your skeletal templates, which you shouldn't delete).
if you have failover provision and then you have a cluster in the modern sense - many machines, one system - a full copy of the main db on each server.
A full discussion of PostgreSQL High Availability would be an answer in itself and there are many different options - I would read what PostgreSQL themselves have to say about this and also, this post by PostgresPro (big hitters in the PostgreSQL world) which provides a list of 4 systems which can do this job:
Finally, there is Percona, (and see here) and SeveralNines - both big in the (Open Source) database world.
You need to read all of these posts, follow the links and ensure that you understand the pros and cons of each system and what compromises you and your stakeholders can/want to make (budget, RTO/RPO, expertise).
Last word on "clusters":
Finally, and to add a bit of complication to the mix, there are now PostgreSQL systems that are "natively" distributed. There's TimescaleDB and Citusdata - which are "distributed PostgreSQL". These work by sharding - i.e. different chunks of data on different machines, while maintaining a (user-specified, normally prime number) level of redundancy - the consensus algorithms work by "voting" - with a prime number of copies of the data, ties are impossible.
It is worth nothing that both (source available) are based on the amazingly powerful PostgreSQL system of extensions! You might want to take a look there also (out of scope for this question).
Finally, from another set of heavy hitters comes this (modern) definition (top of page 2) of a cluster:
A database cluster consists of N database instances running on N physically separate machines sharing no components and connected to each other by a network. Each instance contains a complete copy of the data, and is able to start and maintain arbitrary point-in-time snapshot transactions.
So, these guys' definition is different to all of the previous ones... go figure...
Deep dive:
Finally, if you wish to do a deep dive on this, you could do worse than look at this article - it's a review (2024) of all of the major db systems out there by the biggest guy in databases (both theory and commerical) - i.e. Michael Stonebraker (see above) and another big academic hitter - Andy Pavlo.
You might want to pay particular attention to the NewSQL systems (CockroachDB, Yugabyte & TiDB) to see how SQL can be combined with "clusters" (slightly different for each) and having multiple copies of data (not necessarily the entire db) on different machines - the data is divided into "ranges", "tablets" or "pods" and consistency is maintained by the Raft or Paxos consensus algorithms.
May you live in interesting times!
-
so, If I have a database that was started as a cluster and now the cluster is corrupted, how do I make it back to normal again (non-cluster) ? Di I have to remove and reinstall ? @Véracepadjee– padjee2022年10月21日 06:01:11 +00:00Commented Oct 21, 2022 at 6:01
-
A database will always be in a cluster - it can't be anywhere else but in a cluster. When you say the "cluster is corrupted" - do you mean that your server is no longer up and running (what is the output of
ps -ef | grep post
) or the database itself is corrupt? If it's the latter, then you'll have to restore from your last backup and replay the WAL logs as far as you can. Corruption is usually a hardware issue so I'd try doing it on another machine and see how that goes? Failing that, I'd consult a company specialising in PostgreSQL.Vérace– Vérace2022年10月21日 14:21:27 +00:00Commented Oct 21, 2022 at 14:21 -
what if
systemctl status [email protected]
shows this error○しろまる [email protected] - PostgreSQL Cluster 12-main Loaded: loaded (/lib/systemd/system/[email protected]; disabled; vendor preset: enabled) Active: inactive (dead)
whereassudo service postgresql status
showspsql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory Is the server running locally and accepting connections on that socket?
so, is the database service already started or not ? @Véracepadjee– padjee2022年10月22日 12:21:58 +00:00Commented Oct 22, 2022 at 12:21 -
ps -ef | grep post
shows : root 6099 6093 0 19:30 pts/1 00:00:00 su - postgres postgres 6100 6099 0 19:30 pts/1 00:00:00 -bash postgres 6148 6100 0 19:34 pts/1 00:00:00 ps -ef postgres 6149 6100 0 19:34 pts/1 00:00:00 grep post` @Véracepadjee– padjee2022年10月22日 12:37:15 +00:00Commented Oct 22, 2022 at 12:37 -
Please use comments to get my (or whoever's) attention that you have responded to my (their) request and that you have put the info requested into the question itself! The question is easier to read that way - thanks! What you have written is barely legible and it's very difficult to understand - it appears that you're not running postgresql at the moment - put all of the above as formatted text into the question and we'll take things from there!Vérace– Vérace2022年10月22日 14:48:41 +00:00Commented Oct 22, 2022 at 14:48