I have a csv file which has the below mentioned data of Table names in a sql server instance in ubuntu.
Tables
TableName1
TableName2
TableName3
.
.
.
I want to read this csv file and I want to get the table data and store as ${table_name}.csv
How can I achieve this using pentaho. I tried a method but I want to know if there any built in methods or more efficient way of doing it. I'm new to pentaho so any advice is appreciable.
These are the details of the job I already tried. job screenshot
- The first set variables step is to initialize a variable for the loop
- The csv reader job is where I used a bash script to read the csv file and the total num of lines and store as variable in a config.properties file
#!/bin/bash
# CSV file path
csv_file="/home/ubuntuv2204/taskDir/tables.csv"
property_file="/home/ubuntuv2204/taskDir/dwconfig.properties"
# Get the total number of rows in the CSV file (excluding the header)
total_rows=$(($(wc -l < "$csv_file")))
# Read the second line of the CSV file (excluding the header) and store it as table_name
table_name=$(sed '${NEW_LOOP}q;d' "$csv_file" | cut -d ',' -f 1)
# Check if the table_name is not empty
if [ -n "$table_name" ]; then
# Print the table name
echo "Table Name: $table_name"
else
echo "Table name is empty or CSV file is not formatted correctly."
fi
# Store the total number of rows in a variable called loop_break
#loop_break=$total_rows
#echo "#DW" > "$property_file"
echo "table_name=$table_name" > "$property_file"
echo "loop_break=$total_rows" >> "$property_file"
- The Next step is the loop transformation to increase the loop value everytime
- The set dw tranformation reads the config.properties file and set variable for table_name and total no of lines.
- rw_ktr has table input step and read the table and writes as txt file output.
- Simple evaluation step checks if the loop value is equal to the total num of lines in the csv then the job ends that's how I have written it.
This is working as per my requirement however I don't think it's that much good and I need an efficient solution.
-
Please elaborate on: "I don't think it's that much good and I need an efficient solution". What do you want improved? Why do you need to export tables' data into a CSV files?Alex– Alex2024年05月30日 16:13:40 +00:00Commented May 30, 2024 at 16:13
-
My version of job is complex I don't think I need this much steps to run this job. Why I'm exporting to csv is that if anyone want to get the table data for a certain time period like March Month I can ask their requirement in a csv file and get the data out of the table. I don't want to give read permission to the tables in server. I'm planning to implement this for multiple projects too, so I can't run a job which takes more time to run. I wan't to simplify this as much as possible. If I could do this later I can change the csv output as s3 output or anything that matches the requirement.Aravind S– Aravind S2024年05月31日 07:19:28 +00:00Commented May 31, 2024 at 7:19
-
If you do not want to give people read access to your database, why not create a database copy (i.e. a reporting database) and give users access to this? CSV export will not work for any non trivial amounts of data. Consuming this data is also more challenging than connecting to a DB.Alex– Alex2024年05月31日 08:33:54 +00:00Commented May 31, 2024 at 8:33
-
That is already implemented, there is a seperate instance for bi. This data monthly once needed to share to certain people for reference. It will happen for others projects too that's why I thought of doing this. If I create an etl that works for all I can run this say monthly once and I could get the result. Mostly all tables are similar for each project. And csv is just for testing All I want is to read the input of multiple tables and loop through them and do certain step ( this can always be changed).Aravind S– Aravind S2024年05月31日 09:02:02 +00:00Commented May 31, 2024 at 9:02
-
1I am not familiar with Pentaho. In SQL Server you can use SSIS to do what you want. It is considered very efficient. Whether it will be any better than your current solution I don't know.Alex– Alex2024年06月01日 04:07:14 +00:00Commented Jun 1, 2024 at 4:07
2 Answers 2
It is definetely possible to achieve what you want, but i would have to know the exact steps to give you a 100% answer. What i can do is point you to most of them.
Use variables and Execute for every row option on job execution. You can utilize a whole host of variables from anywhere inside the transformations, even to change Database connections, Instances, User and Passwords. You would use those variables in a loop job, kind of like a "for each database connection, do this". I have this case solved with this approach. If you have trouble with the "Execute for every input row" part, i have this link as well to help.
1 Comment
I am asuming you try to use this to make some kind of export of your database?
Have a job call a single transformation. In the transformation have a datagrid step which maintains all table names. (this can be replaced with an sql input which retrieves all table names directly from the database if you want to process a full database). Then inside the same transformation have a transformation executer with the variable of the table name as an input variable. (For example ${table}). Set the tranformation executor to "for each row" so it loops through all the tables.
Inside the transformation of the transformation executor, retrieve the columns of the table from the database system (how to get them depends on the db system used). Use the columnnames to retrieve the data, for example by making an selct statement where you have an inputvariable field, which is created as concatenating of all variables in the table with a seperator in between.
(for example in postgresl: "select ? as result from ${TABLE}" where you beforehand prepare ? = "name||','||age||','||gender" from the previous step)
Afterwards write the data which is now in a single column "result" to the file ${table}.csv.
This should loop over all tables, retrieve per table the fields involved, and then retrieve the column names of that table. Then load the data from the table, and then write it to a specific file, and move to the next table.
Tried it on a local system, and it runs at about 20k records per sec on my system for a 10 column table (mid range hardware).
An alternative (and in my view better solution) requires a bit more knowledge of Pentaho. It would mean to write the whole process as a metadata injection of a generalized transformaton to load and store a database table to a csv. But that requires more explanation then quickly can be given op stack overflow. More info here: Pantaho documentation - metadata injection
Comments
Explore related questions
See similar questions with these tags.