Newest 'data-engineering' Questions

Stack Overflow

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

70 questions

Newest Active Bountied Unanswered

Advice

0 votes

1 replies

36 views

what are validation checks that are made inorder to push a data to reject folder in medallion architecture?

In my dataset, I noticed that the actual data type of a column differs from the expected data type.In this situation, should the data be type-cast during processing, or should such records be moved to ...

Charubala's user avatar

Charubala

asked Dec 15 at 11:26

Best practices

0 votes

1 replies

35 views

When should data go to Archive vs Reject in Bronze layer (Medallion Architecture)?

Can anybody help with understanding the Archive and Reject folders in bronze layer at Medallion Architecture. Let say i have 4 folders in Bronze namely Raw, Stage, Archive and Reject. At what extent a ...

Charubala's user avatar

Charubala

asked Dec 15 at 10:44

Best practices

0 votes

0 replies

61 views

Materialising tables for multiple end user profiles in Redshift

Imagine there's a reporting tool for which users might have the permission 'Admin' or 'User'. We have a dimension in our models called admin_view and if the value is true then only users with Admin ...

Objectionne's user avatar

Objectionne

asked Nov 24 at 22:45

Advice

4 votes

1 replies

77 views

Parquet VS ORC In Iceberg

Hi I have been interested lately in learning iceberg. There is something was not able to get so I thought I would ask here. I really wanna know why is Apache parquet the native file format used when ...

katz daniel's user avatar

katz daniel

asked Nov 24 at 15:00

Advice

0 votes

4 replies

97 views

Ways to Improve Bulk-Insert Throughput in Azure SQL

I’m attempting high-volume bulk inserts into Azure SQL, but the performance is lower than expected. One known factor is the Max Log Rate (MiB/s) limit, which depends on the service tier (see Microsoft’...

Bill's user avatar

Bill

asked Nov 18 at 10:11

Best practices

0 votes

5 replies

96 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

Parth Sarthi Roy's user avatar

Parth Sarthi Roy

asked Nov 14 at 6:13

1 vote

0 answers

84 views

How to obtain BigQuery Dataform metadata for dependencies/dependents info?

Is there any solution to use Python to extract BigQuery Dataform metadata of something else to get dependencies/dependents of each action in repository? The purpose is that I want to collect the ...

Korapat Thongwatthananon's user avatar

Korapat Thongwatthananon

asked Nov 6 at 6:30

Best practices

0 votes

0 replies

36 views

How would one draw an ERD for this question?

How would this relational schema be drawn as an ERD? My attempt is shown above, though it is incorrect. I do not understand why. Here is the relational schema: CREATE TABLE student ( name TEXT, ...

Theo's user avatar

Theo

asked Nov 2 at 16:40

-4 votes

1 answer

81 views

Programmatically modifying IBM DataStage job XML – changes not reflected after reimport [closed]

I’m trying to programmatically add a new database stage in parallel to an existing DataStage job by modifying its exported XML. I export the job from DataStage Designer, modify the XML via a Python ...

DataEngineer03's user avatar

DataEngineer03

asked Oct 24 at 11:29

1 vote

1 answer

86 views

Reconfigure a Pandas Dataframe [duplicate]

Our old ERP system generates orphaned HTML reports with the following format which I import into Pandas Work Order Item Type Material Labor 0 552603 Budget 71119 4567 1 552603 ...

Woody 1470's user avatar

Woody 1470

asked Oct 8 at 22:26

0 votes

0 answers

65 views

How to automate manual data downloading in Python/R?

I work with clinical data at a company that, until I arrived, didn't have a data policy. Currently, raw data extraction relies solely on manually downloading CSV/Excel files from an internal portal ...

Luis H Rodriguez 's user avatar

Luis H Rodriguez

asked Sep 3 at 9:00

0 votes

0 answers

89 views

Airflow ModuleNotFoundError: No module named 'pyarrow'

I'm trying Apache Airflow for the first time and built a simple ETL. But after loading the data and proceeding to the transform phase, it throws an error because it says pyarrow was not found. Im ...

Enzo Martins's user avatar

Enzo Martins

asked Aug 12 at 6:14

0 votes

0 answers

315 views

dbt snapshots with `check_cols: all` still fail when adding new columns in v1.9.8 (Regression)

Just wanted to flag a frustrating issue I've run into with dbt snapshots that seems to be a regression. Maybe get your ideas for work arounds? TL;DR: If you use the check strategy with check_cols: all,...

omar's user avatar

omar

asked Jun 13 at 13:54

-1 votes

1 answer

112 views

How to Use Filter Activity Output as a Source in Copy Activity in Azure Data Factory Pipeline

I'm fairly new to Azure Data Factory and need help with a pipeline I'm building. My goal is to read data from a CSV file stored in an Amazon S3 bucket, filter out records where the Status column is '...

Tarun Sahu's user avatar

Tarun Sahu

asked Jun 5 at 13:00

0 votes

1 answer

274 views

dbt run with compiled models without re-compilation

In my current dbt project, each time a dbt model is run, a new container is created, and run the command dbt run --select <model name>. So, each time it runs, the whole dbt project needs to ...

mhtuan's user avatar

mhtuan

asked Jun 2 at 15:55

15 30 50 per page

2 3 4 5 Next

CollectivesTM on Stack Overflow

what are validation checks that are made inorder to push a data to reject folder in medallion architecture?

When should data go to Archive vs Reject in Bronze layer (Medallion Architecture)?

Materialising tables for multiple end user profiles in Redshift

Parquet VS ORC In Iceberg

Ways to Improve Bulk-Insert Throughput in Azure SQL

Pushing down filters in RDBMS with Java Spark

How to obtain BigQuery Dataform metadata for dependencies/dependents info?

How would one draw an ERD for this question?

Programmatically modifying IBM DataStage job XML – changes not reflected after reimport [closed]

Reconfigure a Pandas Dataframe [duplicate]

How to automate manual data downloading in Python/R?

Airflow ModuleNotFoundError: No module named 'pyarrow'

dbt snapshots with `check_cols: all` still fail when adding new columns in v1.9.8 (Regression)

How to Use Filter Activity Output as a Source in Copy Activity in Azure Data Factory Pipeline

dbt run with compiled models without re-compilation

Hot Network Questions