Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Jayvardhan-Reddy/BigData-Ecosystem-Architecture

Repository files navigation

BigData Ecosystem Architecture

Internal working of Bigdata and it's ecosystems such as

  • The background process of resource allocation, database connection.
  • How the data is distributed across the nodes.
  • Execution life-cycle on submitting a Job.

** Note: Refer the links metioned below under each ecosystem for detailed explanation **

1. HDFS 🐘

The various underlying process that takes place during the storage of a file into HDFS such as:

  • Type of scheduler

  • Block & Rack information

  • File size

  • File location

  • Replication information about the file(Over-replicated blocks, Under-replicated blocks, ...)

  • Health status of the file

Please click on the link below to know the execution and flow process

πŸ”— HDFS Architecture in Depth

2. SQOOP :octocat:

Used to perform 2 main operations.

  • Sqoop Import:

    • To ingest data from any source such as traditional databases into hadoop file system HDFS
  • Sqoop Export:

    • To export data from hadoop file system HDFS to any traditional databases

To support the above two operations internally a CodeGen is used.

  • Sqoop CodeGen:

    • To compile metadata and other relative information into java class file & create a Jar

Please click on the link below to know the execution and flow process

πŸ”— SQOOP Architecture in Depth

3. HIVE 🐝

It has mainly 4 components

  • Hadoop core components(Hdfs, MapReduce)

  • Metastore

  • Driver

  • Hive Clients

Please click on the link below to know the execution and flow process

πŸ”— HIVE Architecture in Depth

4. SPARK πŸ’₯

The various phases involved before and during the execution of a spark job.

  • Spark Context

    • It is the heart of spark application.
  • Yarn Resource Manager, Application Master & launching of executors (containers).

  • Setting up environment variables, job resources.

  • CoarseGrainedExecutorBackend & Netty-based RPC.

  • SparkListeners.

    • LiveListenerBus
    • StatsReportListener
    • EventLoggingListener
  • Execution of a job

    • Logical Plan (Lineage)
    • Physical Plan (DAG)
  • Spark-WebUI.

Please click on the link below to know the execution and flow process

πŸ”— SPARK Architecture in Depth

4.1 SPARK Abstraction Layers & Internal Optimization Techniques used πŸ’₯

It has 3 different variants as part of it.

  • RDD (Resilient Distributed Datasets)

    • Lineage Graph
    • DAG Scheduler
  • DataFrames

    • Catalyst Optimizer
    • Tungsten Engine
    • Default source or Base relation
  • Datasets

    • Optimized Tungsten Engine - V2
    • Whole Stage Code Generation

5. HBASE πŸ‹

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /