I'm trying to read a JSON API with the Postgres copy from program command.
When I call
copy _temp from program 'curl -uri "https://url"';
it works fine but the API has paging and I need to loop through multiple calls.
When I call it like this:
_command := 'curl -uri "https://url?&page='||(10::text)||"';
copy _temp from program _command;`
I get
ERROR: syntax error at or near "_command"
You can't even concatenate in place like
copy _temp from program 'curl -uri "https://url?&page='||(10::text)||'"';
Percent parametric replacement in the raise notice style doesn't work either.
What gives? program is a literal string with single quotes so what is the difference between specifying a literal string and using a text or varchar variable? There doesn't seem to be any program data type (::program cast does nothing), what am I missing?
In the docs it says 'it is best to use a fixed command string' not that you can only use a fixed command string...
How do I use a dynamic command string?
2 Answers 2
If you want to use dynamic commands, you need the right context for it. Plain SQL is not one.
One possibility is using PL/pgSQL, and it's EXECUTE
functionality. Use DO
to get into the context:
DO $$
DECLARE page integer;
BEGIN
FOR page IN SELECT i FROM generate_series(1, 10) AS t(i)
LOOP
EXECUTE format($cmd$ COPY _temp
FROM PROGRAM 'curl -uri "https://url?&page=%s"'$cmd,ドル
page::text);
END LOOP;
END; $$;
The other way is to run the loop on the shell side. The following example is using bash
:
for i in $(seq 1 10); do
psql -c "COPY _temp FROM PROGRAM 'curl -uri https://url?&page=${i}'"
done
Notes:
- I am using two levels of dollar-quoting (
$$
and$cmd$
) to prevent quote multiplication. - I am also using
format()
instead of the messy concatenation. - If you really have to quote the URI of the
curl
command, you will have to escape the quotes in there.
On the matter of crawling
If the website has paging you're not going to want to do this in psql, nor shell. The reason for that simple, you (normally) won't know where to stop. That's one of the tricks of paging.
- Some servers return 404 when you go far.
- Some servers return 500 (though they shouldn't)
- Some servers return 200 and tell you there are no more results.
At that point that you're doing this you need to start looking into web crawling. Factors to consider,
- Crawling in serial, or parallel
- Creating an HTTP request
- Parsing the response header (unless your framework provides it)
- Parsing the response body (if you need to process html)
- Handling failure (retry, etc)
I'm of the opinion that the best compromise in this department is Node/Rx.js/request-promise/Cheerio: I do it a lot.
JSON Api?
I think there would be a lot of disagreement on a "proper json API". And, there are other drawbacks regardless of what you want to call it,
- Your db session will be tied up waiting for curl
- The transaction is subject to any server-wide timeout limits.
- The user-functions execute in serial in the session, so you won't get to take advantage parallel requests (asynchronous crawling)
- You'll be tearing up a new TCP connection each time you run curl.