Moving 5,000,000 rows to another Postgresql DBs by Clojure & JDBC Moving 5,000,000 rows to another Postgresql DBs by Clojure & JDBC postgresql postgresql

Moving 5,000,000 rows to another Postgresql DBs by Clojure & JDBC


I think the key observation here is that while your query is lazily streaming results from one DB, your insert is just one giant write to the other DB. With regard to memory usage, I don't think it makes much difference whether you've streamed the query results lazily or not if you're collecting all those results (in-memory) for a single write operation at the end.

One way to balance memory usage with throughput is to batch the writes:

(db/with-db-transaction [tx {:datasource source-db}]  (db/query tx    [(db/prepare-statement (:connection tx)                           answer-sql                           {:result-type :forward-only                            :concurrency :read-only                            :fetch-size 2000})]    {:as-arrays? true     :result-set-fn (fn [result-set]                      (let [keys (first result-set)                            values (rest result-set)]                        (doseq [batch (partition-all 2000 values)]                          (db/insert-multi! {:datasource dct-db}                                            :dim_answers                                            keys                                            batch))))}))

The difference is this uses partition-all to insert values in batches (the same size as :fetch-size but I'm sure this could be tuned). Compare the performance/memory usage of this approach with the other by setting JVM max heap size to something like -Xmx1g. I couldn't get the non-batched version to complete using this heap size.

I was able to migrate 6 million small-ish rows between local PostgreSQL DBs on my laptop in ~1 minute and with java using <400MB memory. I also used HikariCP.

If you do insert in batches, you may want to consider wrapping all the inserts in single transaction if it suits your use case. I left the additional transaction out here for brevity.

If i dont use :max-size parameter memory explodes

I can't find any reference (besides a spec) to this option in the latest clojure.java.jdbc, and it didn't affect my testing. I do see a :max-rows but surely you don't want that.

I think it is because the :as-array parameter.

I would expect this to be beneficial to memory usage; the row vectors should be more space-efficient than row maps.


This solution works best for me and it also seems faster than Taylor's solution. But huge thank for helping me.

It doesnt commit until the transaction is done. I have to experience any problems yet to see if I wont have to pimp it but i am happy for now. I've tried to replace first transaction with with-db-connection but it makes the records load straight into RAM.

(defn data->transfer2 [sql table]     (jdbc/with-db-transaction [read-tx {:datasource dag-db}]     (jdbc/with-db-transaction [tx {:datasource dct-db}]        (jdbc/query read-tx                  [(jdbc/prepare-statement (:connection read-tx)                                           answer-sql                                           {:result-type :forward-only                                            :concurrency :read-only                                            :fetch-size 100000})]                  {:as-arrays? true                   :result-set-fn (fn [result-set]                                    (let [keys (first result-set)                                          values (rest result-set)]                                      (doseq [btch (partition-all 100000 values)]                                        (jdbc/insert-multi! tx                                                            :dim_answers                                                             keys                                                             btch))))})))