Introduction to Big Data Tools

Learn why big data processing exists, how distributed computing works, and why tools like Apache Spark handle the heavy lifting so you can focus on business logic.

Pulkit
Pulkit
15 min read
System DesignBig DataDistributed SystemsApache SparkData Engineering

Big data processing is all about a set of tools that help you process data in a distributed fashion. When one machine is not capable of processing a huge amount of data, we do divide and conquer.

That's it. That's big data processing.

TL;DR

  • Big data processing = divide and conquer when one machine isn't enough
  • Commodity hardware: Regular general-purpose machines, not specialized expensive hardware
  • The problem: Data too large to fit on one machine, or computation too slow on a single machine
  • The solution: Distribute the workload across multiple machines
  • Coordinator pattern: One node coordinates, others execute
  • Why tools exist: They handle failures, recovery, scaling, completion, and cleanup
  • You focus on: Business logic only
  • Start with: Apache Spark. It's the most essential big data tool

The Problem

Companies use big data tools to process massive amounts of data and extract insights. Train machine learning models. Move data across databases. Feed data into sales systems, advertisement platforms, and everything else.

Big data processing isn't just about sums, counts, mins, and maxes. It's about copying data to other places, transforming data, enriching data. All of this on commodity hardware.

Commodity hardware means normal general-purpose computation machines you get on any cloud. Not specialized hardware. Specialized hardware is expensive. Commodity hardware is everywhere.

Word Frequency: A Simple Example

Let's count word frequency. You have a 1 TB text dataset. For each word, you want to know how many times it occurred.

Approach 1: Simple

The simple approach:

  1. Load the entire 1 TB dataset on one machine
  2. Read character by character
  3. When you encounter a space, update an in-memory hash table
TYPESCRIPT
const wordFreq: Map<string, number> = new Map();

function processWord(word: string): void {
  const count = wordFreq.get(word) || 0;
  wordFreq.set(word, count + 1);
}

This runs in O(n) time. You traverse the entire file exactly once. Build the hash map. Done.

But can you make it faster? Can you parallelize it?

Approach 2: Threads

You can easily parallelize this code. Each thread handles a chunk of the file.

TYPESCRIPT
async function processChunk(
  start: number,
  end: number,
): Promise<Map<string, number>> {
  const localFreq: Map<string, number> = new Map();

  for (let i = start; i < end; i++) {
    const word = getWordAt(i);
    const count = localFreq.get(word) || 0;
    localFreq.set(word, count + 1);
  }

  return localFreq;
}

const chunkSize = fileSize / threadCount;
const results = await Promise.all(
  threads.map((_, i) =>
    processChunk(i * chunkSize, (i + 1) * chunkSize),
  ),
);

const finalFreq = mergeResults(results);

If you have 10 threads, each handles 100 GB. You get roughly 10x speed benefit.

But what if data is 100 TB? Something that doesn't fit on one machine?

Or what if the computation itself is slow, and you only have 2 CPUs? What will 100 threads do on 2 CPUs? Nothing useful.

But if you have 10 machines, each with 2 threads, you get 20 threads actually working in parallel. Much faster.

This is distributed computing.

Approach 3: Distributed Computing

Distributed computing splits data into partitions across multiple servers

The idea:

  1. Split the file into partitions (say 10 partitions of 100 GB each)
  2. Distribute partitions across multiple servers
  3. Each server computes word frequency independently
  4. Send results to one coordinator machine
  5. Coordinator merges results
  6. Return final result
TYPESCRIPT
interface Partition {
  id: number;
  data: Buffer;
}

interface WorkerResult {
  workerId: string;
  frequencies: Map<string, number>;
}

function distributeWork(
  partitions: Partition[],
  workers: Worker[],
): void {
  partitions.forEach((partition, i) => {
    const worker = workers[i % workers.length];
    worker.process(partition);
  });
}

function mergeResults(
  results: WorkerResult[],
): Map<string, number> {
  const final: Map<string, number> = new Map();

  for (const result of results) {
    for (const [word, count] of result.frequencies) {
      const existing = final.get(word) || 0;
      final.set(word, existing + count);
    }
  }

  return final;
}

The Coordinator Pattern

User submits job to coordinator which distributes work to workers

The flow looks like this:

  1. You submit a job to the coordinator
  2. Coordinator splits the file into partitions
  3. Coordinator assigns partitions to workers
  4. Workers process their chunks
  5. Workers send results back to coordinator
  6. Coordinator merges everything
  7. Coordinator returns the final result

The coordinator orchestrates the entire execution across multiple machines, gathers results, and sends them back to you.

Why We Need Big Data Tools

Now think about what can go wrong:

ChallengeQuestion
FailuresWhat if a worker crashes while processing its 100 GB partition?
RecoveryWho takes care of re-processing the failed chunk?
CompletionWhat if a worker partially computed but never sent a response?
ScalingWhat if we need to add more workers?
Error HandlingWho reports what went wrong and where?

Someone has to take ownership. The coordinator does.

But do we have to write this coordinator every time? Can't a good open-source tool handle this once and for all?

This is exactly what big data tools do.

Big data tools handle distribution, failures, recovery so you focus on business logic

You write the business logic. The tool handles:

  • Distribution across machines
  • Failure detection and recovery
  • Ensuring completion
  • Scaling up and down
  • Error reporting and logging
  • Data transformation
  • Cleanup after job completion

When the job is complete, the tool cleans up temporary files, prepares for the next job. You don't worry about any of it.

Apache Spark

Spark master node coordinates worker nodes for distributed processing

Spark is the most popular big data processing tool. It does large-scale data processing on commodity hardware.

One Spark node is the master. Everyone else is a worker. They do what you ask them to do.

Spark handles:

  • Distribution of work
  • Completion guarantees
  • Fault tolerance (if a worker goes down, reassign its chunk to another worker)

Example: Data Warehouse ETL

Say you want to combine data from:

  • Users database (MySQL)
  • Orders database (PostgreSQL)
  • Payments database (MongoDB)
  • Logistics database (something else)

And put everything into a data warehouse like Amazon Redshift.

One machine reading from all these databases, merging, transforming, and writing? Very slow.

Doing it in a distributed way? Fast.

Spark has connectors for almost all popular databases. You don't have to know the dialect of each database. You use simple DataFrames:

PYTHON
users_df = spark.read.format("jdbc").option("url", mysql_url).load()
orders_df = spark.read.format("jdbc").option("url", postgres_url).load()
payments_df = spark.read.format("mongodb").load()

combined = users_df.join(orders_df, "user_id").join(payments_df, "order_id")

combined.write.format("jdbc").option("url", redshift_url).save()

Spark ensures no row or document is missed. That's its job.

Example: Event Enrichment

You're getting user events into Kafka. A blog is published. You want to enrich those events with:

  • Who published the blog?
  • Is this a paid user?

Then send enriched events to Elasticsearch for visualization.

One machine making two database calls per event, enriching, and writing? Not scalable.

Multiple machines reading from Kafka, making calls, enriching, writing to Elasticsearch? That scales.

PYTHON
events = spark.readStream.format("kafka").load()

enriched = events.join(users_df, "user_id")
enriched = enriched.withColumn("is_paid", users_df["subscription"] == "paid")

enriched.writeStream.format("elasticsearch").start()

Big data processing isn't just about aggregations. It's about any computation you want to do in a distributed fashion.

Other Big Data Tools

Spark is the most famous, but there are many others:

ToolUse Case
Apache SparkGeneral-purpose batch/stream
Apache FlinkStream processing
Apache KafkaEvent streaming
Apache AirflowWorkflow orchestration
AirbyteData integration
Apache HadoopDistributed storage + compute
HDFSDistributed file system
MapReduceBatch processing (legacy)
Apache PinotReal-time analytics
Apache NiFiData flow automation
DuckDBIn-process analytics
Apache PrestoDistributed SQL queries

Each tool solves a specific niche. Spark is the essential one. Everything else is optional but useful depending on your use case.

The Core Concept

No matter which tool you use, the concept remains the same:

  1. One node can't do it → distribute the work
  2. Who handles distribution? → the tool
  3. Who handles completion? → the tool
  4. Who handles failures? → the tool
  5. Who handles recovery? → the tool
  6. What do you do? → write business logic

The tool takes care of all the mundane things. People implemented it once, open-sourced it, and everyone uses it.

Practice

To understand Spark:

  1. Set up Spark locally

  2. Process sample data: Take a sales dataset and write Spark jobs to generate insights

PYTHON
sales_df = spark.read.csv("sales.csv", header=True)

sales_df.groupBy("product").agg({"amount": "sum"}).show()

sales_df.groupBy("region").count().show()
  1. Learn Spark Streaming: Connect Spark with Kafka. Process events as they arrive
PYTHON
events = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "blog_published") \
    .load()

enriched = events.join(users_df, "user_id")

query = enriched.writeStream \
    .format("json") \
    .option("path", "/output/enriched") \
    .start()

For each event, enrich it with user details and flush to disk.

You don't need multiple machines to learn. Do it on your local machine to build understanding.

Key Takeaways

  1. Big data = divide and conquer: When one machine isn't enough

  2. Commodity hardware: Regular machines, not specialized expensive ones

  3. Coordinator pattern: One node coordinates, others execute

  4. Tools handle the hard stuff: Distribution, failures, recovery, completion, cleanup

  5. You focus on business logic: The tool does everything else

  6. Start with Spark: It's the most essential big data tool. Spark Streaming for real-time processing

Conclusion

Big data processing exists because one machine can't handle everything. You distribute the work across multiple machines. But managing that distribution, handling failures, ensuring completion, that's complex.

Big data tools solve this. You write business logic. The tool handles distribution, failures, recovery, and cleanup.

Without Big Data ToolsWith Big Data Tools
Write coordinator from scratchCoordinator built-in
Handle failures manuallyAutomatic failure recovery
Ensure completion yourselfGuaranteed completion
Clean up temporary files manuallyAutomatic cleanup
Write connectors for each DBConnectors for all major databases
Reinvent the wheel every timeFocus only on business logic

Start with Apache Spark. Learn DataFrames. Learn Spark Streaming. That foundation will serve you well across the entire big data ecosystem.

Related Posts

More posts you might enjoy

Made with ❤️ by Pulkit

© 2026 Pulkit. All rights reserved

DMCA Verified

Last updated: