5,958 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
Best practices
1
vote
2
replies
109
views
T-SQL ETL update evaluation recommendation: What is the most elegant & performative way to evaluate a source value to update a target value?
What recommendations might be offered for the most elegant and performative way with T-SQL to evaluate whether a source value should update a target value, as part of an ETL update process in which ...
1
vote
1
answer
63
views
AWS Glue PySpark job taking 4 hours to process small JSON files from S3
I have an AWS Glue job that processes thousands of small JSON files from S3 (historical data load for Adobe Experience Platform). The job is taking approximately 4 hours to complete, which is ...
Best practices
0
votes
0
replies
45
views
How can I set up an ETL process for DataFrames (dbt or alternative tools)?
Question: I'm currently working on a dashboard prototype and storing data from a 22-page PDF document as 22 separate DataFrames. These DataFrames should undergo an ETL process (especially data type ...
0
votes
1
answer
87
views
Why are my data quality validation rules not triggering for null values in my dataset?
I’m working on a data quality workflow where I validate incoming records for null or missing values.
Even when a column clearly contains nulls, my rule doesn’t trigger and the record passes validation....
0
votes
1
answer
61
views
DataStage XML export modified via Python — new stage not appearing after re-import
I’m working with IBM InfoSphere DataStage 11.7.
I exported several jobs as XML files using istool export.
Then, using a Python script, I modified the XML to add another database stage in parallel to ...
1
vote
1
answer
69
views
Why aren’t my changes reflected after modifying and reimporting an IBM DataStage job XML export?
I’m trying to programmatically modify IBM DataStage jobs to add a new database connector stage in parallel to an existing Database stage.
Here’s my workflow:
Export a job from DataStage Designer as ...
2
votes
0
answers
100
views
Using Prefect with FastAPI is still displaying old logs
I tried using Prefect with FastAPI project. Then when I updated logs and redeployed the repo as well as Prefect deployments and flows. It runs and displays the logs (Basically , Prefect is still ...
-4
votes
1
answer
83
views
Programmatically modifying IBM DataStage job XML – changes not reflected after reimport [closed]
I’m trying to programmatically add a new database stage in parallel to an existing DataStage job by modifying its exported XML. I export the job from DataStage Designer, modify the XML via a Python ...
0
votes
0
answers
52
views
How to use data pre-computed in previous ETL SSIS Nodes?
I'm building ETL packages in SSIS. My data comes from an OLE DB Source that calls a stored procedure in SQL Server. I want to add a new Lookup (or a similar transformation) that uses some of the input ...
0
votes
0
answers
198
views
Unable to start worker on prefect - httpx.connecterror: all connection attempts failed
I have started prefect server on Remote Desktop using
prefect server start —-host 0.0.0.0 —-port 8080
After this I am able to access the UI from different computers present on this network. I create a ...
1
vote
2
answers
185
views
Power Query – Cancel last N positives based on N negatives
I have a table in Power Query like this:
PO - Purchase Order
SID - Ship ID
QTY - Quantity
PO
SID
QTY
1001
A001
2000
1001
A001
2000
1001
A001
-2000 (This line cancel the previous one)
1002
A002
3000
...
0
votes
1
answer
148
views
Can't connect to Ollama hosted locally from python script
I am building ETL using LLM to extract some information.
I have ollama installed locally. I am on Macbook M4 Max.
I don't understand why I have this error from my worker.
ads-worker-1 | 2025年08月28日 15:...
0
votes
0
answers
99
views
Apache Flink FileSink compaction extremely slow with many hot buckets/paths
I have a Flink ETL job that reads from ~13 Kafka topics and writes data into HDFS using a FileSink with compaction enabled.
Right now, we have around 40 different output paths (buckets), and roughly ...
0
votes
0
answers
56
views
Error loading data: 'Engine' object has no attribute 'cursor': chan="stdout": source="task"
I am trying to run a batch process using Apache Airflow. The Extract and Transform stages work very fine but the load stages is giving an error. Here is my code:
from airflow.decorators import dag, ...
0
votes
0
answers
92
views
Airflow ModuleNotFoundError: No module named 'pyarrow'
I'm trying Apache Airflow for the first time and built a simple ETL. But after loading the data and proceeding to the transform phase, it throws an error because it says pyarrow was not found. Im ...