7,945 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
-6
votes
0
answers
37
views
Is Apache Spark Structured Streaming suitable for dynamic segmentation with 5000+ segments and CDC? [closed]
I’m evaluating Apache Spark Structured Streaming for a large-scale dynamic contact segmentation use case and would like guidance on feasibility and recommended design patterns.
Scenario
We have:
~35 ...
Advice
0
votes
0
replies
36
views
Modular production for "see-through" factories and data center shells
I’ve been thinking about modular production for "transparent factories" and data center outer frames/shells. Do you think this approach actually improves data center construction, and how much ...
Best practices
0
votes
5
replies
97
views
Pushing down filters in RDBMS with Java Spark
I have been working as a Data Engineer and got this issue.
I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source.
Now somewhere ...
6
votes
0
answers
129
views
partially decode, stream and filter big data with tensorflow_datasets (tfds)
I have two issues (Note that this code is generated in google colab):
Issue 1 I want to stream the droid dataset, which is almost 2TB big. I want to only use data which matches my filter conditions. ...
1
vote
3
answers
205
views
Pandas DataFrame with a hundred million entries and counting the number of identical characters in strings
I have a pandas DataFrame (df) with two columns (namely Tuple and Set) and approximately 100,000,000 entries. The Tuple column data is a string of exactly 9 characters. The Set column data is an ...
1
vote
1
answer
51
views
DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"
I am using below code to create Dataproc Spark Session to run a job
from google.cloud.dataproc_spark_connect import DataprocSparkSession
from google.cloud.dataproc_v1 import Session
session = Session(...
0
votes
0
answers
85
views
How to merge small parquet files in Hudi into larger files
I use Spark+ Hudi to write data into S3. I was writing data in bulk_insert mode, which cause there be many small paruqet files in Hudi table.
Then I try to schedule clustering on the Hudi table:
...
0
votes
1
answer
54
views
Power BI connection to DorisDB fails with "Character set 'utf8mb3' is not supported by .NET Framework"
I am trying to connect Power BI Desktop to our Apache Doris database (which is the VeloDB-Doris distribution). I am using the standard MySQL data source connector in Power BI, as Doris is compatible ...
1
vote
1
answer
83
views
How to break down a column which contains several different features, so that a new column is built for each feature
I want to break down a column which contains several different features, so that a new column is built for each feature, also taking as column name the feature name. I already tried with:
data = {'...
0
votes
1
answer
79
views
Geowave or S2 index for squares and rectangles
Geowave, Geomesa and S2 Geometry offers a Hilbert index that seems suitable for a quadrilateral grid, with a unique 64-bit cell_ID per cell, for all grid levels...
However, I don't see how to use ...
0
votes
1
answer
55
views
How can I immediately reclaim disk space after dropping a table (or quickly purge its tablets)?
I’m running an Apache Doris 2.1.7 cluster (3 FEs + 6 BEs) on CentOS 7.
After issuing DROP TABLE big_fact, the table disappears from the information_schema, but the underlying tablets remain on every ...
0
votes
0
answers
18
views
"Error starting FE or unit test locally Cannot find external parser table action_table.dat"
I encountered an error while setting up and using Doris during unit testing:
Error starting FE or unit test locally Cannot find external parser table action_table.dat
I searched the community and ...
1
vote
1
answer
144
views
Apache Doris FE Cluster: "Clock delta: xxxx ms between Feeder: xxxx and this Replica exceeds max permitted delta: xxxx ms" causes BDB
I encountered an issue while running an Apache Doris FE cluster, where the fe.log file shows the following error:
2024年01月09日 14:46:23,840 WARN (UNKNOWN fe_f78cf069_b094_4d9d_ac9c_ddc521dd494d(-1)|1) [...
0
votes
0
answers
57
views
Apache Doris query fails with error: [E-230]missed_versions is empty - How to diagnose and fix?
We are intermittently encountering a query failure on our Apache Doris cluster. The query fails completely with the following error message:
Query error: [E-230]missed_versions is empty
This error ...
0
votes
0
answers
73
views
Query error: "Failed to get scan range, no queryable replica found in tablet: xxxx"
During the process of setting up and using Doris, I encountered a query error:
Failed to get scan range, no queryable replica found in tablet: xxxx
This error seems to be a scanning error for the ...