70 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
Advice
0
votes
1
replies
36
views
what are validation checks that are made inorder to push a data to reject folder in medallion architecture?
In my dataset, I noticed that the actual data type of a column differs from the expected data type.In this situation, should the data be type-cast during processing, or should such records be moved to ...
Best practices
0
votes
1
replies
35
views
When should data go to Archive vs Reject in Bronze layer (Medallion Architecture)?
Can anybody help with understanding the Archive and Reject folders in bronze layer at Medallion Architecture. Let say i have 4 folders in Bronze namely Raw, Stage, Archive and Reject. At what extent a ...
Best practices
0
votes
0
replies
61
views
Materialising tables for multiple end user profiles in Redshift
Imagine there's a reporting tool for which users might have the permission 'Admin' or 'User'.
We have a dimension in our models called admin_view and if the value is true then only users with Admin ...
Advice
4
votes
1
replies
77
views
Parquet VS ORC In Iceberg
Hi I have been interested lately in learning iceberg. There is something was not able to get so I thought I would ask here.
I really wanna know why is Apache parquet the native file format used when ...
Advice
0
votes
4
replies
97
views
Ways to Improve Bulk-Insert Throughput in Azure SQL
I’m attempting high-volume bulk inserts into Azure SQL, but the performance is lower than expected. One known factor is the Max Log Rate (MiB/s) limit, which depends on the service tier (see Microsoft’...
Best practices
0
votes
5
replies
96
views
Pushing down filters in RDBMS with Java Spark
I have been working as a Data Engineer and got this issue.
I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source.
Now somewhere ...
1
vote
0
answers
84
views
How to obtain BigQuery Dataform metadata for dependencies/dependents info?
Is there any solution to use Python to extract BigQuery Dataform metadata of something else to get dependencies/dependents of each action in repository? The purpose is that I want to collect the ...
Best practices
0
votes
0
replies
36
views
How would one draw an ERD for this question?
How would this relational schema be drawn as an ERD? My attempt is shown above, though it is incorrect. I do not understand why. Here is the relational schema:
CREATE TABLE student
(
name TEXT,
...
-4
votes
1
answer
81
views
Programmatically modifying IBM DataStage job XML – changes not reflected after reimport [closed]
I’m trying to programmatically add a new database stage in parallel to an existing DataStage job by modifying its exported XML. I export the job from DataStage Designer, modify the XML via a Python ...
1
vote
1
answer
86
views
Reconfigure a Pandas Dataframe [duplicate]
Our old ERP system generates orphaned HTML reports with the following format which I import into Pandas
Work Order Item Type Material Labor
0 552603 Budget 71119 4567
1 552603 ...
0
votes
0
answers
65
views
How to automate manual data downloading in Python/R?
I work with clinical data at a company that, until I arrived, didn't have a data policy. Currently, raw data extraction relies solely on manually downloading CSV/Excel files from an internal portal ...
0
votes
0
answers
89
views
Airflow ModuleNotFoundError: No module named 'pyarrow'
I'm trying Apache Airflow for the first time and built a simple ETL. But after loading the data and proceeding to the transform phase, it throws an error because it says pyarrow was not found. Im ...
0
votes
0
answers
315
views
dbt snapshots with `check_cols: all` still fail when adding new columns in v1.9.8 (Regression)
Just wanted to flag a frustrating issue I've run into with dbt snapshots that seems to be a regression. Maybe get your ideas for work arounds?
TL;DR: If you use the check strategy with check_cols: all,...
-1
votes
1
answer
112
views
How to Use Filter Activity Output as a Source in Copy Activity in Azure Data Factory Pipeline
I'm fairly new to Azure Data Factory and need help with a pipeline I'm building. My goal is to read data from a CSV file stored in an Amazon S3 bucket, filter out records where the Status column is '...
0
votes
1
answer
274
views
dbt run with compiled models without re-compilation
In my current dbt project, each time a dbt model is run, a new container is created, and run the command dbt run --select <model name>. So, each time it runs, the whole dbt project needs to ...