We have an application that we use for Network Monitoring and it uses a Cassandra database. Our Program is ending and the customer wants us to save our trending database for a period of time after the contract ends. I doubt the customer has any DBAs and I am the only software person on the program. My thought for ease of data retention and if the customer should want the data is to export the Cassandra keyspace and then import it to MySQL which may be an easier database for the poor shlap that gets tasked with using the data. So far I have been able to export the data into JSON files using cassandra-exporter. But when I try to import the JSON files (>1 GB) Mysql workbench crashes. I am not a DBA, just google a lot haha. Is there a better way to get the import the data to mysql? Should I even try this path? Should I just leave it as JSON? Or should I leave the data in Cassandra and just turn over the database file(s?) for the customer to deal with?
Tried Cassandra-exporter to get the tables into JSON files. Tried importing using MySQL Workbench but application crashes on large files.
-
Mysql workbench is not the best toapp ever built, but all GUI apps would struggle parsing 1GB of json. However, at the moment I don't see a programming question here and SO is for programming-related questions only. Questions about what you should do with the data is also beyond the scope of SO.Shadow– Shadow2025年05月12日 17:50:11 +00:00Commented May 12 at 17:50
-
when I try to import the JSON files (>1 GB) Mysql workbench crashes. Use not WB but CLI. Is there a better way to get the import the data to mysql? Move your data via plain CSV.Akina– Akina2025年05月12日 20:05:21 +00:00Commented May 12 at 20:05
-
OTOH, using COPY TO and using the result as a source for Scriptella should do the trick, since it has a CSV driver.Markus W Mahlberg– Markus W Mahlberg2025年05月18日 10:58:33 +00:00Commented May 18 at 10:58
1 Answer 1
What you want is batch operations.
You could split your json exports, but then importing becomes pretty tedious.
So what we want is automated batch import based on the exported data.
I made several assumptions here:
- The dataset is too large to import via mysqlimport.
- The export and dump has to happen multiple times, so convenience is of essence.
- The export is not continuous.
Here is my proposal:
- Create a CSV export on the
cqlsh
via COPY TO - Use the venerable Scriptella for automated batch-import into an ephemeral MySQL/MariaDB instance.
- Dump it from there.
I have prepared a little demo via docker-compose
. All the required files are inlined and commented, so you should be able to simply run:
docker compose pull
docker compose build #optional
docker compose up
The demo docker-compose.yaml
This is a demo complete with a Cassandra instance to show the workflow. It is thoroughly commented, make sure you read the details.
configs:
cassandra_keyspace.cql:
# setup script for our demo.
content: |
CREATE KEYSPACE IF NOT EXISTS sample WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
AND DURABLE_WRITES = true;
cassandra_users.cql:
# We insert some demo users into the users table.
content: |
CREATE TABLE IF NOT EXISTS sample.users (
id UUID PRIMARY KEY,
lname text,
fname text );
INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'John', 'Doe');
INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Jane', 'Doe');
INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Jim', 'Beam');
INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Jack', 'Daniels');
INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Johnny', 'Walker');
cassandra_export.cql:
# This is the first interesting part of the demo.
# We export the data from the Cassandra database to a CSV file.
# See https://docs.datastax.com/en/cql/hcd/reference/cqlsh-commands/copy.html
# for more information.
# The CSV file is created in the shared volume, which is mounted to /mnt/data
# in both the cassandra and loader containers.
content: |
COPY sample.users (id, fname, lname) TO '/mnt/data/export.csv' WITH HEADER = true;
scriptella_etl.properties:
# Scriptella ETL Configuration Properties
# https://scriptella.org/reference/index.html#%3Cproperties%3E
# Basically, we define variables that are used in the script.
content: |
#Scriptella ETL Configuration Properties
driver=mysql
url=jdbc:mysql://db:3306/test
user=username
password=password
classpath=/usr/local/lib/scriptella/lib/mysql-connector-j-9.3.0.jar
ops=3
scriptella_etl.xml:
# Scriptella ETL Configuration XML
# Our actual conversion script.
# For each row in the CSV file, we insert a new row into the MySQL database.
# In this demo, the CSV file is created by the Cassandra export script.
# The CSV file is created in the shared volume, which is mounted to /mnt/data
# in both the cassandra and loader containers.
#
# Note that the double dollar signs are only here to escape the dollar sign
# within the docker-compose.yaml file.
#
# ALSO note that you will definetly want to adjust the
# ops property in the property file to your needs.
# More info on batching in scriptella:
# http://scriptella.org/docs/api/scriptella/jdbc/package-summary.html#batching
content: |
<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
<description>Import of arbitrary length CSV file into MySQL</description>
<properties>
<include href="etl.properties"/> <!--Load from external properties file-->
</properties>
<connection id="in" driver="csv" url="/mnt/data/export.csv">
statement.fetchSize=$$ops
</connection>
<connection driver="$$driver" id="out" url="$$url" user="$$user"
password="$$password" classpath="$$classpath">
statement.batchSize=$$ops
</connection>
<query connection-id="in">
<!--Empty
query means select all -->
<script connection-id="out">
INSERT INTO users (firstname,lastname) VALUES (?fname,?lname);
</script>
</query>
</etl>
mariadb_init.sql:
# Just a simple SQL script to create the table.
# You will need to adjust that to your data model(s).
content: |
CREATE TABLE IF NOT EXISTS users (
id INT AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(255) NOT NULL,
lastname VARCHAR(255) NOT NULL
);
volumes:
shared:
# This is a shared volume that is used by the cassandra and loader containers.
# It is used to store the CSV file that is created by the cassandra export script.
# The loader container uses this file to import the data into the MySQL database.
# This is only needed for this demo.
# For actual usage, you would create the csv file on your cassandra cluster,
# copy it to your local machine, and then bind-mount it to the loader container.
# driver: local
# driver_opts:
# o: "uid=1001"
services:
volume-permissions:
# This is a simple container that is used to set the permissions on the shared volume.
# It is needed because the cassandra container runs as root and the loader container
# runs as a non-root user.
image: alpine:latest
volumes:
- type: volume
source: shared
target: /mnt/data
entrypoint:
- /bin/sh
- -c
- |
echo "Setting permissions on shared volume..."
chown -R 1001:1001 /mnt/data
cassandra:
# The cassandra container is only used in this demo.
# For actual usage, you can remove this container and instead
# bind-mount the CSV file you createed on your cassandra cluster
# into the loader container.
image: bitnami/cassandra:5.0
depends_on:
# We want to ensure that the volume-permissions container has finished
# before we start the cassandra container.
volume-permissions:
condition: service_completed_successfully
configs:
# Setup the keyspace and users table and force the export.
- source: cassandra_keyspace.cql
target: /docker-entrypoint-initdb.d/01_keyspace.cql
- source: cassandra_users.cql
target: /docker-entrypoint-initdb.d/02_users.cql
- source: cassandra_export.cql
target: /docker-entrypoint-initdb.d/03_export.cql
volumes:
# The shared volume is used to store the CSV file that is created by the cassandra export script.
- type: volume
source: shared
target: /mnt/data
healthcheck:
# We want to ensure that the cassandra container is healthy before we start the loader container.
test: ["CMD-SHELL", "nodetool status"]
interval: 15s
start_period: 2m
timeout: 10s
retries: 3
db:
# The db container is the MySQL database that we want to import the data into.
# However, it is only temporary: we just want to load the data into it and then dump it to a file.
# Note that we do not use a volume for this container.
# This allows you to simply down the stack and start over with a clean database.
image: mariadb:11-ubi
environment:
MARIADB_DATABASE: test
MARIADB_USER: username
MARIADB_PASSWORD: password
MARIADB_ROOT_PASSWORD: t0Ps3cr3t
configs:
- source: mariadb_init.sql
target: /docker-entrypoint-initdb.d/01_init.sql
healthcheck:
# We want to ensure that the db container is healthy before we start the loader container.
test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]
start_period: 10s
interval: 10s
timeout: 5s
retries: 3
loader:
image: mwmahlberg/scriptella:8-jre-1.2-rel0
build:
# Scriptella is a weee bit old and pretty much abandoned.
# So we need to build our own image with the MySQL connector included.
# This is a multi-stage build: we first download the scriptella and mysql connector
# and then copy them into the final image.
context: .
dockerfile_inline: |
FROM alpine:latest AS downloader
ARG SCRIPTELLA_VERSION=1.2
ARG MYSQL_CONNECTOR_VERSION=9.3.0
WORKDIR /usr/local/src/
RUN apk add curl unzip && \
curl -LO https://github.com/scriptella/scriptella-etl/releases/download/scriptella-parent-$${SCRIPTELLA_VERSION}/scriptella-$${SCRIPTELLA_VERSION}.zip && \
unzip scriptella-$${SCRIPTELLA_VERSION}.zip && \
curl -qsL https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.tar.gz \
| tar -xzv --strip-components 1 -C scriptella-1.2/lib/ mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.jar
FROM eclipse-temurin:11-jre-alpine-3.21
ARG SCRIPTELLA_VERSION=1.2
ENV PATH="$PATH:/usr/local/lib/scriptella/bin/"
COPY --from=downloader /usr/local/src/scriptella-$${SCRIPTELLA_VERSION} /usr/local/lib/scriptella
WORKDIR /opt/export
ENTRYPOINT [ "$$JAVA_HOME/bin/java","-jar","/usr/local/lib/scriptella/scriptella.jar"]
depends_on:
# We want to ensure that both our source and target
# databases are healthy before we start the loader container.
# For actual usage, you can reemove the depends_on section for cassandra.
cassandra:
condition: service_healthy
db:
condition: service_healthy
configs:
- source: scriptella_etl.properties
target: /opt/export/etl.properties
- source: scriptella_etl.xml
target: /opt/export/etl.xml
volumes:
# The shared volume is used to store the CSV file that is created by the cassandra export script.
# The loader container uses this file to import the data into the MySQL database.
# This is only needed for this demo.
# For actual usage, you would create the csv file on your cassandra cluster,
# copy it to your local machine, and then bind-mount it here into the loader container.
# eg:
# - type: bind
# source: path/on/host/to/your-csv-file.csv
# target: /mnt/data/export.csv
- type: volume
source: shared
target: /mnt/data
# The entrypoint is where the magic happens.
# We wait for the cassandra export to become available and then run the scriptella ETL job.
# The scriptella job will read the CSV file and insert the data into the MySQL database
# line by line.
entrypoint:
- /bin/sh
- -c
- |
until [ -f /mnt/data/export.csv ]
do
echo "Waiting for Cassandra to finish the export..."
sleep 5
done
$$JAVA_HOME/bin/java -jar /usr/local/lib/scriptella/scriptella.jar
dumper:
# The final step is to dump the MySQL database to a file.
image: alpine:latest
depends_on:
# We want to ensure that the db container is healthy...
db:
condition: service_healthy
# ...and that the loader container has finished before we start the dumper container.
loader:
condition: service_completed_successfully
volumes:
# We bind-mount the the local directory ./dump to /mnt/data in the container.
# This is where the MySQL dump will be stored.
- type: bind
source: ./dump
target: /mnt/data
entrypoint:
- /bin/sh
- -c
- |
echo "Installing MySQL client..."
apk add --no-cache mysql-client
echo "Dumping MySQL data..."
mysqldump --skip-ssl-verify-server-cert -h db -u username -ppassword test | bzip2 -v9 > /mnt/data/dump.sql.bz2
Now that we have seen that the workflow works, we can get go to
The real life
also available as a gist
The docker-compose.yaml
only needs slight adjustments.
configs:
scriptella_etl.properties:
# Scriptella ETL Configuration Properties
# https://scriptella.org/reference/index.html#%3Cproperties%3E
# Basically, we define variables that are used in the script.
file: etl.properties
scriptella_etl.xml:
# Scriptella ETL Configuration XML
# Our actual conversion script.
# For each row in the CSV file, we insert a new row into the MySQL database.
# In this demo, the CSV file is created by the Cassandra export script.
# The CSV file is created in the shared volume, which is mounted to /mnt/data
# in both the cassandra and loader containers.
#
# Note that the double dollar signs are only here to escape the dollar sign
# within the docker-compose.yaml file.
#
# ALSO note that you will definetly want to adjust the
# ops property in the property file to your needs.
# More info on batching in scriptella:
# http://scriptella.org/docs/api/scriptella/jdbc/package-summary.html#batching
content: |
<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
<description>Import of arbitrary length CSV file into MySQL</description>
<properties>
<include href="etl.properties"/> <!--Load from external properties file-->
</properties>
<connection id="in" driver="csv" url="/mnt/data/export.csv">
statement.fetchSize=$$ops
</connection>
<connection driver="$$driver" id="out" url="$$url" user="$$user"
password="$$password" classpath="$$classpath">
statement.batchSize=$$ops
</connection>
<query connection-id="in">
<!--Empty
query means select all -->
<script connection-id="out">
INSERT INTO users (firstname,lastname) VALUES (?fname,?lname);
</script>
</query>
</etl>
mariadb_init.sql:
# Just a simple SQL script to create the table.
# You will need to adjust that to your data model(s).
content: |
CREATE TABLE IF NOT EXISTS users (
id INT AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(255) NOT NULL,
lastname VARCHAR(255) NOT NULL
);
services:
db:
# The db container is the MySQL database that we want to import the data into.
# However, it is only temporary: we just want to load the data into it and then dump it to a file.
# Note that we do not use a volume for this container.
# This allows you to simply down the stack and start over with a clean database.
image: mariadb:11-ubi
environment:
MARIADB_DATABASE: test
MARIADB_USER: username
MARIADB_PASSWORD: password
MARIADB_ROOT_PASSWORD: t0Ps3cr3t
configs:
- source: mariadb_init.sql
target: /docker-entrypoint-initdb.d/01_init.sql
healthcheck:
# We want to ensure that the db container is healthy before we start the loader container.
test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]
start_period: 10s
interval: 10s
timeout: 5s
retries: 3
loader:
image: mwmahlberg/scriptella:8-jre-1.2-rel0
build:
# Scriptella is a weee bit old and pretty much abandoned.
# So we need to build our own image with the MySQL connector included.
# This is a multi-stage build: we first download the scriptella and mysql connector
# and then copy them into the final image.
context: .
dockerfile_inline: |
FROM alpine:latest AS downloader
ARG SCRIPTELLA_VERSION=1.2
ARG MYSQL_CONNECTOR_VERSION=9.3.0
WORKDIR /usr/local/src/
RUN apk add curl unzip && \
curl -LO https://github.com/scriptella/scriptella-etl/releases/download/scriptella-parent-$${SCRIPTELLA_VERSION}/scriptella-$${SCRIPTELLA_VERSION}.zip && \
unzip scriptella-$${SCRIPTELLA_VERSION}.zip && \
curl -qsL https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.tar.gz \
| tar -xzv --strip-components 1 -C scriptella-1.2/lib/ mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.jar
FROM eclipse-temurin:11-jre-alpine-3.21
ARG SCRIPTELLA_VERSION=1.2
ENV PATH="$PATH:/usr/local/lib/scriptella/bin/"
COPY --from=downloader /usr/local/src/scriptella-$${SCRIPTELLA_VERSION} /usr/local/lib/scriptella
WORKDIR /opt/export
ENTRYPOINT [ "$$JAVA_HOME/bin/java","-jar","/usr/local/lib/scriptella/scriptella.jar"]
depends_on:
# We want to ensure that the db container is healthy.
db:
condition: service_healthy
configs:
- source: scriptella_etl.properties
target: /opt/export/etl.properties
- source: scriptella_etl.xml
target: /opt/export/etl.xml
volumes:
- type: bind
source: ./export.csv
target: /mnt/data/export.csv
# The entrypoint is where the magic happens.
# We wait for the cassandra export to become available and then run the scriptella ETL job.
# The scriptella job will read the CSV file and insert the data into the MySQL database
# line by line.
entrypoint:
- /bin/ash
- -c
- |
until [ -f /mnt/data/export.csv ]
do
echo "Waiting for Cassandra to finish the export..."
sleep 5
done
$$JAVA_HOME/bin/java -jar /usr/local/lib/scriptella/scriptella.jar
dumper:
# The final step is to dump the MySQL database to a file.
image: alpine:latest
depends_on:
# We want to ensure that the db container is healthy...
db:
condition: service_healthy
# ...and that the loader container has finished before we start the dumper container.
loader:
condition: service_completed_successfully
volumes:
# We bind-mount the the local directory ./dump to /mnt/data in the container.
# This is where the MySQL dump will be stored.
- type: bind
source: ./dump
target: /mnt/data
entrypoint:
- /bin/sh
- -c
- |
echo "Installing MySQL client..."
apk add --no-cache mysql-client
echo "Dumping MySQL data..."
mysqldump --skip-ssl-verify-server-cert -h db -u username -ppassword test | bzip2 -v9 > /mnt/data/dump.sql.bz2
The workflow goes like this:
Adjust
scriptella_etl.xml
andmariadb_init.sql
to fit your data.Create a dump of the data via COPY TO
Copy/Move it to the directory of the docker-compose.yaml above under the name
export.csv
Create a file named
etl.properties
and adjust it to your needs. Unless you want to use a preexisting MySQL/MariaDB server, basically you only have to play withops
.#Scriptella ETL Configuration Properties driver=mysql url=jdbc:mysql://db:3306/test user=username password=password classpath=/usr/local/lib/scriptella/lib/mysql-connector-j-9.3.0.jar ops=3
Run
docker-compose up
Press Ctrl+C when the dumper container exits.
You should find your bzipped MySQL dump in the
dump
directory.
Multiple tables
In case you need to export multiple tables, simply repeat the process with an adjusted config. Alternatively, you can add more loaders and dumpers to the compose file.