1

I have a csv file which has the below mentioned data of Table names in a sql server instance in ubuntu.

Tables
TableName1
TableName2
TableName3
.
.
.

I want to read this csv file and I want to get the table data and store as ${table_name}.csv

How can I achieve this using pentaho. I tried a method but I want to know if there any built in methods or more efficient way of doing it. I'm new to pentaho so any advice is appreciable.

These are the details of the job I already tried. job screenshot

  • The first set variables step is to initialize a variable for the loop
  • The csv reader job is where I used a bash script to read the csv file and the total num of lines and store as variable in a config.properties file
#!/bin/bash
# CSV file path
csv_file="/home/ubuntuv2204/taskDir/tables.csv"
property_file="/home/ubuntuv2204/taskDir/dwconfig.properties"
# Get the total number of rows in the CSV file (excluding the header)
total_rows=$(($(wc -l < "$csv_file")))
# Read the second line of the CSV file (excluding the header) and store it as table_name
table_name=$(sed '${NEW_LOOP}q;d' "$csv_file" | cut -d ',' -f 1)
# Check if the table_name is not empty
if [ -n "$table_name" ]; then
 # Print the table name
 echo "Table Name: $table_name"
else
 echo "Table name is empty or CSV file is not formatted correctly."
fi
# Store the total number of rows in a variable called loop_break
#loop_break=$total_rows
#echo "#DW" > "$property_file"
echo "table_name=$table_name" > "$property_file"
echo "loop_break=$total_rows" >> "$property_file"
  • The Next step is the loop transformation to increase the loop value everytime
  • The set dw tranformation reads the config.properties file and set variable for table_name and total no of lines.
  • rw_ktr has table input step and read the table and writes as txt file output.
  • Simple evaluation step checks if the loop value is equal to the total num of lines in the csv then the job ends that's how I have written it.

This is working as per my requirement however I don't think it's that much good and I need an efficient solution.

asked May 30, 2024 at 12:39
5
  • Please elaborate on: "I don't think it's that much good and I need an efficient solution". What do you want improved? Why do you need to export tables' data into a CSV files? Commented May 30, 2024 at 16:13
  • My version of job is complex I don't think I need this much steps to run this job. Why I'm exporting to csv is that if anyone want to get the table data for a certain time period like March Month I can ask their requirement in a csv file and get the data out of the table. I don't want to give read permission to the tables in server. I'm planning to implement this for multiple projects too, so I can't run a job which takes more time to run. I wan't to simplify this as much as possible. If I could do this later I can change the csv output as s3 output or anything that matches the requirement. Commented May 31, 2024 at 7:19
  • If you do not want to give people read access to your database, why not create a database copy (i.e. a reporting database) and give users access to this? CSV export will not work for any non trivial amounts of data. Consuming this data is also more challenging than connecting to a DB. Commented May 31, 2024 at 8:33
  • That is already implemented, there is a seperate instance for bi. This data monthly once needed to share to certain people for reference. It will happen for others projects too that's why I thought of doing this. If I create an etl that works for all I can run this say monthly once and I could get the result. Mostly all tables are similar for each project. And csv is just for testing All I want is to read the input of multiple tables and loop through them and do certain step ( this can always be changed). Commented May 31, 2024 at 9:02
  • 1
    I am not familiar with Pentaho. In SQL Server you can use SSIS to do what you want. It is considered very efficient. Whether it will be any better than your current solution I don't know. Commented Jun 1, 2024 at 4:07

2 Answers 2

1

It is definetely possible to achieve what you want, but i would have to know the exact steps to give you a 100% answer. What i can do is point you to most of them.

Use variables and Execute for every row option on job execution. You can utilize a whole host of variables from anywhere inside the transformations, even to change Database connections, Instances, User and Passwords. You would use those variables in a loop job, kind of like a "for each database connection, do this". I have this case solved with this approach. If you have trouble with the "Execute for every input row" part, i have this link as well to help.

answered Jun 4, 2024 at 20:29
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I have a doubt that if im using a csv or an xml file with table names, I can use text file input -> copy rows to result -> set variables -> table input. The problem I face is that how to set variable of each row using loop(like how to read first row in the first execution and likewise). If u could help me with a sample transformation step, that would be great.
0

I am asuming you try to use this to make some kind of export of your database?

Have a job call a single transformation. In the transformation have a datagrid step which maintains all table names. (this can be replaced with an sql input which retrieves all table names directly from the database if you want to process a full database). Then inside the same transformation have a transformation executer with the variable of the table name as an input variable. (For example ${table}). Set the tranformation executor to "for each row" so it loops through all the tables.

Inside the transformation of the transformation executor, retrieve the columns of the table from the database system (how to get them depends on the db system used). Use the columnnames to retrieve the data, for example by making an selct statement where you have an inputvariable field, which is created as concatenating of all variables in the table with a seperator in between.

(for example in postgresl: "select ? as result from ${TABLE}" where you beforehand prepare ? = "name||','||age||','||gender" from the previous step)

Afterwards write the data which is now in a single column "result" to the file ${table}.csv.

This should loop over all tables, retrieve per table the fields involved, and then retrieve the column names of that table. Then load the data from the table, and then write it to a specific file, and move to the next table.

Tried it on a local system, and it runs at about 20k records per sec on my system for a 10 column table (mid range hardware).

An alternative (and in my view better solution) requires a bit more knowledge of Pentaho. It would mean to write the whole process as a metadata injection of a generalized transformaton to load and store a database table to a csv. But that requires more explanation then quickly can be given op stack overflow. More info here: Pantaho documentation - metadata injection

answered Jun 27, 2024 at 10:26

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.