0

We have an application that we use for Network Monitoring and it uses a Cassandra database. Our Program is ending and the customer wants us to save our trending database for a period of time after the contract ends. I doubt the customer has any DBAs and I am the only software person on the program. My thought for ease of data retention and if the customer should want the data is to export the Cassandra keyspace and then import it to MySQL which may be an easier database for the poor shlap that gets tasked with using the data. So far I have been able to export the data into JSON files using cassandra-exporter. But when I try to import the JSON files (>1 GB) Mysql workbench crashes. I am not a DBA, just google a lot haha. Is there a better way to get the import the data to mysql? Should I even try this path? Should I just leave it as JSON? Or should I leave the data in Cassandra and just turn over the database file(s?) for the customer to deal with?

Tried Cassandra-exporter to get the tables into JSON files. Tried importing using MySQL Workbench but application crashes on large files.

asked May 12 at 17:15
3
  • Mysql workbench is not the best toapp ever built, but all GUI apps would struggle parsing 1GB of json. However, at the moment I don't see a programming question here and SO is for programming-related questions only. Questions about what you should do with the data is also beyond the scope of SO. Commented May 12 at 17:50
  • when I try to import the JSON files (>1 GB) Mysql workbench crashes. Use not WB but CLI. Is there a better way to get the import the data to mysql? Move your data via plain CSV. Commented May 12 at 20:05
  • OTOH, using COPY TO and using the result as a source for Scriptella should do the trick, since it has a CSV driver. Commented May 18 at 10:58

1 Answer 1

0

What you want is batch operations.

You could split your json exports, but then importing becomes pretty tedious.

So what we want is automated batch import based on the exported data.

I made several assumptions here:

  1. The dataset is too large to import via mysqlimport.
  2. The export and dump has to happen multiple times, so convenience is of essence.
  3. The export is not continuous.

Here is my proposal:

  1. Create a CSV export on the cqlsh via COPY TO
  2. Use the venerable Scriptella for automated batch-import into an ephemeral MySQL/MariaDB instance.
  3. Dump it from there.

I have prepared a little demo via docker-compose. All the required files are inlined and commented, so you should be able to simply run:

docker compose pull
docker compose build #optional
docker compose up

The demo docker-compose.yaml

available as a gist

This is a demo complete with a Cassandra instance to show the workflow. It is thoroughly commented, make sure you read the details.

configs:
 cassandra_keyspace.cql:
 # setup script for our demo.
 content: |
 CREATE KEYSPACE IF NOT EXISTS sample WITH REPLICATION =
 { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
 AND DURABLE_WRITES = true;
 cassandra_users.cql:
 # We insert some demo users into the users table.
 content: |
 CREATE TABLE IF NOT EXISTS sample.users ( 
 id UUID PRIMARY KEY, 
 lname text, 
 fname text );
 INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'John', 'Doe');
 INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Jane', 'Doe');
 INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Jim', 'Beam');
 INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Jack', 'Daniels');
 INSERT INTO sample.users (id, fname, lname) VALUES (uuid(), 'Johnny', 'Walker');
 cassandra_export.cql:
 # This is the first interesting part of the demo.
 # We export the data from the Cassandra database to a CSV file.
 # See https://docs.datastax.com/en/cql/hcd/reference/cqlsh-commands/copy.html
 # for more information.
 # The CSV file is created in the shared volume, which is mounted to /mnt/data
 # in both the cassandra and loader containers.
 content: |
 COPY sample.users (id, fname, lname) TO '/mnt/data/export.csv' WITH HEADER = true; 
 scriptella_etl.properties:
 # Scriptella ETL Configuration Properties
 # https://scriptella.org/reference/index.html#%3Cproperties%3E
 # Basically, we define variables that are used in the script.
 content: |
 #Scriptella ETL Configuration Properties
 driver=mysql
 url=jdbc:mysql://db:3306/test
 user=username
 password=password
 classpath=/usr/local/lib/scriptella/lib/mysql-connector-j-9.3.0.jar
 ops=3
 scriptella_etl.xml:
 # Scriptella ETL Configuration XML
 # Our actual conversion script.
 # For each row in the CSV file, we insert a new row into the MySQL database.
 # In this demo, the CSV file is created by the Cassandra export script.
 # The CSV file is created in the shared volume, which is mounted to /mnt/data
 # in both the cassandra and loader containers.
 #
 # Note that the double dollar signs are only here to escape the dollar sign
 # within the docker-compose.yaml file.
 #
 # ALSO note that you will definetly want to adjust the
 # ops property in the property file to your needs.
 # More info on batching in scriptella:
 # http://scriptella.org/docs/api/scriptella/jdbc/package-summary.html#batching
 content: |
 <!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
 <etl>
 <description>Import of arbitrary length CSV file into MySQL</description>
 <properties>
 <include href="etl.properties"/> <!--Load from external properties file-->
 </properties>
 <connection id="in" driver="csv" url="/mnt/data/export.csv">
 statement.fetchSize=$$ops
 </connection>
 <connection driver="$$driver" id="out" url="$$url" user="$$user"
 password="$$password" classpath="$$classpath">
 statement.batchSize=$$ops
 </connection>
 <query connection-id="in">
 <!--Empty
 query means select all -->
 <script connection-id="out">
 INSERT INTO users (firstname,lastname) VALUES (?fname,?lname);
 </script>
 </query>
 </etl>
 
 mariadb_init.sql:
 # Just a simple SQL script to create the table.
 # You will need to adjust that to your data model(s).
 content: |
 CREATE TABLE IF NOT EXISTS users (
 id INT AUTO_INCREMENT PRIMARY KEY,
 firstname VARCHAR(255) NOT NULL,
 lastname VARCHAR(255) NOT NULL
 );
volumes:
 shared:
 # This is a shared volume that is used by the cassandra and loader containers.
 # It is used to store the CSV file that is created by the cassandra export script.
 # The loader container uses this file to import the data into the MySQL database.
 # This is only needed for this demo.
 # For actual usage, you would create the csv file on your cassandra cluster,
 # copy it to your local machine, and then bind-mount it to the loader container.
 # driver: local
 # driver_opts:
 # o: "uid=1001"
services:
 volume-permissions:
 # This is a simple container that is used to set the permissions on the shared volume.
 # It is needed because the cassandra container runs as root and the loader container
 # runs as a non-root user.
 image: alpine:latest
 volumes:
 - type: volume
 source: shared
 target: /mnt/data
 entrypoint:
 - /bin/sh
 - -c
 - |
 echo "Setting permissions on shared volume..."
 chown -R 1001:1001 /mnt/data
 cassandra:
 # The cassandra container is only used in this demo.
 # For actual usage, you can remove this container and instead
 # bind-mount the CSV file you createed on your cassandra cluster
 # into the loader container.
 image: bitnami/cassandra:5.0
 depends_on:
 # We want to ensure that the volume-permissions container has finished
 # before we start the cassandra container.
 volume-permissions:
 condition: service_completed_successfully
 configs:
 # Setup the keyspace and users table and force the export.
 - source: cassandra_keyspace.cql
 target: /docker-entrypoint-initdb.d/01_keyspace.cql
 - source: cassandra_users.cql
 target: /docker-entrypoint-initdb.d/02_users.cql
 - source: cassandra_export.cql
 target: /docker-entrypoint-initdb.d/03_export.cql
 volumes:
 # The shared volume is used to store the CSV file that is created by the cassandra export script.
 - type: volume
 source: shared
 target: /mnt/data
 healthcheck:
 # We want to ensure that the cassandra container is healthy before we start the loader container.
 test: ["CMD-SHELL", "nodetool status"]
 interval: 15s
 start_period: 2m
 timeout: 10s
 retries: 3
 db:
 # The db container is the MySQL database that we want to import the data into.
 # However, it is only temporary: we just want to load the data into it and then dump it to a file.
 # Note that we do not use a volume for this container.
 # This allows you to simply down the stack and start over with a clean database.
 image: mariadb:11-ubi
 environment:
 MARIADB_DATABASE: test
 MARIADB_USER: username
 MARIADB_PASSWORD: password
 MARIADB_ROOT_PASSWORD: t0Ps3cr3t
 configs:
 - source: mariadb_init.sql
 target: /docker-entrypoint-initdb.d/01_init.sql
 healthcheck:
 # We want to ensure that the db container is healthy before we start the loader container.
 test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]
 start_period: 10s
 interval: 10s
 timeout: 5s
 retries: 3
 loader:
 image: mwmahlberg/scriptella:8-jre-1.2-rel0
 build:
 # Scriptella is a weee bit old and pretty much abandoned.
 # So we need to build our own image with the MySQL connector included.
 # This is a multi-stage build: we first download the scriptella and mysql connector
 # and then copy them into the final image.
 context: .
 dockerfile_inline: |
 FROM alpine:latest AS downloader
 ARG SCRIPTELLA_VERSION=1.2
 ARG MYSQL_CONNECTOR_VERSION=9.3.0
 WORKDIR /usr/local/src/
 RUN apk add curl unzip && \
 curl -LO https://github.com/scriptella/scriptella-etl/releases/download/scriptella-parent-$${SCRIPTELLA_VERSION}/scriptella-$${SCRIPTELLA_VERSION}.zip && \
 unzip scriptella-$${SCRIPTELLA_VERSION}.zip && \
 curl -qsL https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.tar.gz \
 | tar -xzv --strip-components 1 -C scriptella-1.2/lib/ mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.jar
 FROM eclipse-temurin:11-jre-alpine-3.21
 ARG SCRIPTELLA_VERSION=1.2
 ENV PATH="$PATH:/usr/local/lib/scriptella/bin/"
 COPY --from=downloader /usr/local/src/scriptella-$${SCRIPTELLA_VERSION} /usr/local/lib/scriptella
 WORKDIR /opt/export
 ENTRYPOINT [ "$$JAVA_HOME/bin/java","-jar","/usr/local/lib/scriptella/scriptella.jar"] 
 depends_on:
 # We want to ensure that both our source and target
 # databases are healthy before we start the loader container.
 # For actual usage, you can reemove the depends_on section for cassandra.
 cassandra:
 condition: service_healthy
 db:
 condition: service_healthy
 configs:
 - source: scriptella_etl.properties
 target: /opt/export/etl.properties
 - source: scriptella_etl.xml
 target: /opt/export/etl.xml
 volumes:
 # The shared volume is used to store the CSV file that is created by the cassandra export script.
 # The loader container uses this file to import the data into the MySQL database.
 # This is only needed for this demo.
 # For actual usage, you would create the csv file on your cassandra cluster,
 # copy it to your local machine, and then bind-mount it here into the loader container.
 # eg:
 # - type: bind
 # source: path/on/host/to/your-csv-file.csv
 # target: /mnt/data/export.csv
 - type: volume
 source: shared
 target: /mnt/data
 # The entrypoint is where the magic happens.
 # We wait for the cassandra export to become available and then run the scriptella ETL job.
 # The scriptella job will read the CSV file and insert the data into the MySQL database
 # line by line. 
 entrypoint:
 - /bin/sh
 - -c
 - |
 until [ -f /mnt/data/export.csv ]
 do
 echo "Waiting for Cassandra to finish the export..."
 sleep 5
 done
 $$JAVA_HOME/bin/java -jar /usr/local/lib/scriptella/scriptella.jar
 dumper:
 # The final step is to dump the MySQL database to a file.
 image: alpine:latest
 depends_on:
 # We want to ensure that the db container is healthy...
 db:
 condition: service_healthy
 # ...and that the loader container has finished before we start the dumper container.
 loader:
 condition: service_completed_successfully
 volumes:
 # We bind-mount the the local directory ./dump to /mnt/data in the container.
 # This is where the MySQL dump will be stored.
 - type: bind
 source: ./dump
 target: /mnt/data
 entrypoint:
 - /bin/sh
 - -c
 - |
 echo "Installing MySQL client..."
 apk add --no-cache mysql-client
 echo "Dumping MySQL data..."
 mysqldump --skip-ssl-verify-server-cert -h db -u username -ppassword test | bzip2 -v9 > /mnt/data/dump.sql.bz2

Now that we have seen that the workflow works, we can get go to

The real life

also available as a gist

The docker-compose.yaml only needs slight adjustments.

configs:
 scriptella_etl.properties:
 # Scriptella ETL Configuration Properties
 # https://scriptella.org/reference/index.html#%3Cproperties%3E
 # Basically, we define variables that are used in the script.
 file: etl.properties
 scriptella_etl.xml:
 # Scriptella ETL Configuration XML
 # Our actual conversion script.
 # For each row in the CSV file, we insert a new row into the MySQL database.
 # In this demo, the CSV file is created by the Cassandra export script.
 # The CSV file is created in the shared volume, which is mounted to /mnt/data
 # in both the cassandra and loader containers.
 #
 # Note that the double dollar signs are only here to escape the dollar sign
 # within the docker-compose.yaml file.
 #
 # ALSO note that you will definetly want to adjust the
 # ops property in the property file to your needs.
 # More info on batching in scriptella:
 # http://scriptella.org/docs/api/scriptella/jdbc/package-summary.html#batching
 content: |
 <!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
 <etl>
 <description>Import of arbitrary length CSV file into MySQL</description>
 <properties>
 <include href="etl.properties"/> <!--Load from external properties file-->
 </properties>
 <connection id="in" driver="csv" url="/mnt/data/export.csv">
 statement.fetchSize=$$ops
 </connection>
 <connection driver="$$driver" id="out" url="$$url" user="$$user"
 password="$$password" classpath="$$classpath">
 statement.batchSize=$$ops
 </connection>
 <query connection-id="in">
 <!--Empty
 query means select all -->
 <script connection-id="out">
 INSERT INTO users (firstname,lastname) VALUES (?fname,?lname);
 </script>
 </query>
 </etl>
 
 mariadb_init.sql:
 # Just a simple SQL script to create the table.
 # You will need to adjust that to your data model(s).
 content: |
 CREATE TABLE IF NOT EXISTS users (
 id INT AUTO_INCREMENT PRIMARY KEY,
 firstname VARCHAR(255) NOT NULL,
 lastname VARCHAR(255) NOT NULL
 );
services:
 db:
 # The db container is the MySQL database that we want to import the data into.
 # However, it is only temporary: we just want to load the data into it and then dump it to a file.
 # Note that we do not use a volume for this container.
 # This allows you to simply down the stack and start over with a clean database.
 image: mariadb:11-ubi
 environment:
 MARIADB_DATABASE: test
 MARIADB_USER: username
 MARIADB_PASSWORD: password
 MARIADB_ROOT_PASSWORD: t0Ps3cr3t
 configs:
 - source: mariadb_init.sql
 target: /docker-entrypoint-initdb.d/01_init.sql
 healthcheck:
 # We want to ensure that the db container is healthy before we start the loader container.
 test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]
 start_period: 10s
 interval: 10s
 timeout: 5s
 retries: 3
 loader:
 image: mwmahlberg/scriptella:8-jre-1.2-rel0
 build:
 # Scriptella is a weee bit old and pretty much abandoned.
 # So we need to build our own image with the MySQL connector included.
 # This is a multi-stage build: we first download the scriptella and mysql connector
 # and then copy them into the final image.
 context: .
 dockerfile_inline: |
 FROM alpine:latest AS downloader
 ARG SCRIPTELLA_VERSION=1.2
 ARG MYSQL_CONNECTOR_VERSION=9.3.0
 WORKDIR /usr/local/src/
 RUN apk add curl unzip && \
 curl -LO https://github.com/scriptella/scriptella-etl/releases/download/scriptella-parent-$${SCRIPTELLA_VERSION}/scriptella-$${SCRIPTELLA_VERSION}.zip && \
 unzip scriptella-$${SCRIPTELLA_VERSION}.zip && \
 curl -qsL https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.tar.gz \
 | tar -xzv --strip-components 1 -C scriptella-1.2/lib/ mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}/mysql-connector-j-$${MYSQL_CONNECTOR_VERSION}.jar
 FROM eclipse-temurin:11-jre-alpine-3.21
 ARG SCRIPTELLA_VERSION=1.2
 ENV PATH="$PATH:/usr/local/lib/scriptella/bin/"
 COPY --from=downloader /usr/local/src/scriptella-$${SCRIPTELLA_VERSION} /usr/local/lib/scriptella
 WORKDIR /opt/export
 ENTRYPOINT [ "$$JAVA_HOME/bin/java","-jar","/usr/local/lib/scriptella/scriptella.jar"] 
 depends_on:
 # We want to ensure that the db container is healthy.
 db:
 condition: service_healthy
 configs:
 - source: scriptella_etl.properties
 target: /opt/export/etl.properties
 - source: scriptella_etl.xml
 target: /opt/export/etl.xml
 volumes:
 - type: bind
 source: ./export.csv
 target: /mnt/data/export.csv
 # The entrypoint is where the magic happens.
 # We wait for the cassandra export to become available and then run the scriptella ETL job.
 # The scriptella job will read the CSV file and insert the data into the MySQL database
 # line by line. 
 entrypoint:
 - /bin/ash
 - -c
 - |
 until [ -f /mnt/data/export.csv ]
 do
 echo "Waiting for Cassandra to finish the export..."
 sleep 5
 done
 $$JAVA_HOME/bin/java -jar /usr/local/lib/scriptella/scriptella.jar
 dumper:
 # The final step is to dump the MySQL database to a file.
 image: alpine:latest
 depends_on:
 # We want to ensure that the db container is healthy...
 db:
 condition: service_healthy
 # ...and that the loader container has finished before we start the dumper container.
 loader:
 condition: service_completed_successfully
 volumes:
 # We bind-mount the the local directory ./dump to /mnt/data in the container.
 # This is where the MySQL dump will be stored.
 - type: bind
 source: ./dump
 target: /mnt/data
 entrypoint:
 - /bin/sh
 - -c
 - |
 echo "Installing MySQL client..."
 apk add --no-cache mysql-client
 echo "Dumping MySQL data..."
 mysqldump --skip-ssl-verify-server-cert -h db -u username -ppassword test | bzip2 -v9 > /mnt/data/dump.sql.bz2

The workflow goes like this:

  1. Adjust scriptella_etl.xml and mariadb_init.sql to fit your data.

  2. Create a dump of the data via COPY TO

  3. Copy/Move it to the directory of the docker-compose.yaml above under the name export.csv

  4. Create a file named etl.properties and adjust it to your needs. Unless you want to use a preexisting MySQL/MariaDB server, basically you only have to play with ops.

    #Scriptella ETL Configuration Properties
    driver=mysql
    url=jdbc:mysql://db:3306/test
    user=username
    password=password
    classpath=/usr/local/lib/scriptella/lib/mysql-connector-j-9.3.0.jar
    ops=3
    
  5. Run docker-compose up

  6. Press Ctrl+C when the dumper container exits.

  7. You should find your bzipped MySQL dump in the dump directory.

Multiple tables

In case you need to export multiple tables, simply repeat the process with an adjusted config. Alternatively, you can add more loaders and dumpers to the compose file.

answered May 19 at 15:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.