Newest 'bigdata' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

7,945 questions

-6 votes

0 answers

39 views

Is Apache Spark Structured Streaming suitable for dynamic segmentation with 5000+ segments and CDC? [closed]

I’m evaluating Apache Spark Structured Streaming for a large-scale dynamic contact segmentation use case and would like guidance on feasibility and recommended design patterns. Scenario We have: ~35 ...

Sagardevd's user avatar

Sagardevd

asked 2 days ago

Advice

0 votes

0 replies

36 views

Modular production for "see-through" factories and data center shells

I’ve been thinking about modular production for "transparent factories" and data center outer frames/shells. Do you think this approach actually improves data center construction, and how much ...

Alexmily's user avatar

Alexmily

asked Dec 26, 2025 at 8:48

Best practices

0 votes

5 replies

98 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

Parth Sarthi Roy's user avatar

Parth Sarthi Roy

asked Nov 14, 2025 at 6:13

6 votes

0 answers

129 views

partially decode, stream and filter big data with tensorflow_datasets (tfds)

I have two issues (Note that this code is generated in google colab): Issue 1 I want to stream the droid dataset, which is almost 2TB big. I want to only use data which matches my filter conditions. ...

user31865617's user avatar

user31865617

asked Nov 12, 2025 at 23:43

1 vote

3 answers

205 views

Pandas DataFrame with a hundred million entries and counting the number of identical characters in strings

I have a pandas DataFrame (df) with two columns (namely Tuple and Set) and approximately 100,000,000 entries. The Tuple column data is a string of exactly 9 characters. The Set column data is an ...

Max Pierini's user avatar

Max Pierini

2,355

asked Oct 28, 2025 at 20:21

1 vote

1 answer

51 views

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

I am using below code to create Dataproc Spark Session to run a job from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session = Session(...

Siddiq Syed's user avatar

Siddiq Syed

asked Oct 2, 2025 at 8:07

0 votes

0 answers

85 views

How to merge small parquet files in Hudi into larger files

I use Spark+ Hudi to write data into S3. I was writing data in bulk_insert mode, which cause there be many small paruqet files in Hudi table. Then I try to schedule clustering on the Hudi table: ...

Rinze's user avatar

Rinze

asked Sep 17, 2025 at 10:35

0 votes

1 answer

54 views

Power BI connection to DorisDB fails with "Character set 'utf8mb3' is not supported by .NET Framework"

I am trying to connect Power BI Desktop to our Apache Doris database (which is the VeloDB-Doris distribution). I am using the standard MySQL data source connector in Power BI, as Doris is compatible ...

Michael's user avatar

Michael

asked Jun 27, 2025 at 8:47

1 vote

1 answer

83 views

How to break down a column which contains several different features, so that a new column is built for each feature

I want to break down a column which contains several different features, so that a new column is built for each feature, also taking as column name the feature name. I already tried with: data = {'...

coridefe's user avatar

coridefe

asked Jun 12, 2025 at 13:25

0 votes

1 answer

79 views

Geowave or S2 index for squares and rectangles

Geowave, Geomesa and S2 Geometry offers a Hilbert index that seems suitable for a quadrilateral grid, with a unique 64-bit cell_ID per cell, for all grid levels... However, I don't see how to use ...

Peter Krauss's user avatar

Peter Krauss

14.1k

asked Jun 10, 2025 at 12:34

0 votes

1 answer

55 views

How can I immediately reclaim disk space after dropping a table (or quickly purge its tablets)?

I’m running an Apache Doris 2.1.7 cluster (3 FEs + 6 BEs) on CentOS 7. After issuing DROP TABLE big_fact, the table disappears from the information_schema, but the underlying tablets remain on every ...

user8589466's user avatar

user8589466

asked Jun 10, 2025 at 10:42

0 votes

0 answers

18 views

"Error starting FE or unit test locally Cannot find external parser table action_table.dat"

I encountered an error while setting up and using Doris during unit testing: Error starting FE or unit test locally Cannot find external parser table action_table.dat I searched the community and ...

xyf's user avatar

xyf

asked Jun 9, 2025 at 10:45

1 vote

1 answer

144 views

Apache Doris FE Cluster: "Clock delta: xxxx ms between Feeder: xxxx and this Replica exceeds max permitted delta: xxxx ms" causes BDB

I encountered an issue while running an Apache Doris FE cluster, where the fe.log file shows the following error: 2024年01月09日 14:46:23,840 WARN (UNKNOWN fe_f78cf069_b094_4d9d_ac9c_ddc521dd494d(-1)|1) [...

user8589466's user avatar

user8589466

asked Jun 9, 2025 at 9:48

0 votes

0 answers

57 views

Apache Doris query fails with error: [E-230]missed_versions is empty - How to diagnose and fix?

We are intermittently encountering a query failure on our Apache Doris cluster. The query fails completely with the following error message: Query error: [E-230]missed_versions is empty This error ...

Michael's user avatar

Michael

asked Jun 9, 2025 at 6:40

0 votes

0 answers

73 views

Query error: "Failed to get scan range, no queryable replica found in tablet: xxxx"

During the process of setting up and using Doris, I encountered a query error: Failed to get scan range, no queryable replica found in tablet: xxxx This error seems to be a scanning error for the ...

xyf's user avatar

xyf

asked Jun 4, 2025 at 10:50

15 30 50 per page

2 3 4 5

...

530 Next

CollectivesTM on Stack Overflow

Is Apache Spark Structured Streaming suitable for dynamic segmentation with 5000+ segments and CDC? [closed]

Modular production for "see-through" factories and data center shells

Pushing down filters in RDBMS with Java Spark

partially decode, stream and filter big data with tensorflow_datasets (tfds)

Pandas DataFrame with a hundred million entries and counting the number of identical characters in strings

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

How to merge small parquet files in Hudi into larger files

Power BI connection to DorisDB fails with "Character set 'utf8mb3' is not supported by .NET Framework"

How to break down a column which contains several different features, so that a new column is built for each feature

Geowave or S2 index for squares and rectangles

How can I immediately reclaim disk space after dropping a table (or quickly purge its tablets)?

"Error starting FE or unit test locally Cannot find external parser table action_table.dat"

Apache Doris FE Cluster: "Clock delta: xxxx ms between Feeder: xxxx and this Replica exceeds max permitted delta: xxxx ms" causes BDB

Apache Doris query fails with error: [E-230]missed_versions is empty - How to diagnose and fix?

Query error: "Failed to get scan range, no queryable replica found in tablet: xxxx"

Hot Network Questions