Open source big data platforms

Big data platforms are so complex. Therefore, we rely a lot of open source services and systems to build big data platforms. Such open sources have been developed since many years and they play a crucial role in the big data ecosystems.

Building the big data infrastructure

The big data infrastructure has several components. At the high-level we need to:

Combine dedicated servers + VMs
- Establish elastic infrastructure using elasticity and virtualization
- Utilize containers, VMs with different orchestration systems, like Kubernetes or OpenStack
High performance computing systems
- Enable running parallel and concurrent programs for database, messaging, and analytics
Broker/messaging services: these services serve for different purposes. First of all, they can be used for transferring "raw big data", such as streaming IoT data/event logs, or disseminating analytics results, such as anomaly detection outcome for a time window. Second, they can be used for sending/receiving commands that orchestrate/trigger tasks used for big data platforms. Often we need to combine some of them within a platform.
- Large-scale, high performance with standard protocols, such as AMQP and MQTT
- Support different messaging systems, e.g., RabbitMQ, Apache Kafka, and NSQ
Datastore technologies: data storage, databases and data services.
Processing frameworks

many concepts are difficult and require fulltime training from the university in the computer science programme.

A reference of Big Data Platform for our study

We will follow the following Big Data Platform reference as the key one for our study in this book.

The particular systems to be used are dependent on requirements. The systems mentioned are state-of-the-art ones. We expect that new systems will come and old systems will disappear. However, key interactions and components required for big data platforms would remain.

Build your personal techradar for big data platforms

Since there are many technologies that you can choose, it would be better to build a techradar for your big data expertise. In your personal techradar you can basically focus on a subset of tools/software that you can use for building your big data platforms and you continue to evaluate and revise your radar, based on your expertise, customer requirements, data to be handled, etc.

Select an application domain, a dataset and have simple scenarios for your big data platform
Decide which technologies could be used, based on your constraints and knowhow
Evaluate software and services for building your big data platform

If we look at a snapshot of existing technologies available for big data platforms, we are overwhelming with the number of technologies and solutions that have been evolved in a short time. So would we focus our study in a particular set of technologies, some well-known software stack? No. We should not!. As a fundamental study for building big data platforms, we should learn basic principles and techniques. But we will need to apply these principles and techniques with a subset of selected technologies (based on their popularity and easy to study).

Potential software technologies and services for big data platforms

Provider	Technologies
Google Cloud Platform	You can scan the list of services from a perspective of solutions. You can also review services/technologies based on the data-lifecycle of ingestion, storing, analysis and visualization](https://cloud.google.com/solutions/data-lifecycle-cloud-platform)
Microsoft Azure Cloud	Available services can be viewed from a service catalog for analytics or from a solution catalog
Amazon Web Services	Serverals Database services and Analytics services
Apache	Apache provides several important free software for big data studies. Many of these software are also optimized and offered by existing providers. Examples are: Hadoop,Spark,Cassandra,HBase, Kafka and Airflow
ELK stack	ELK Stack, ELK - ElasticSearch, Kibana, Logstash, includes a software technology stack for solutions as well as for end-to-end data lifecycle services.
TICK stack	The TICK Stack - Telegraf, Infuxdb, Chronograf, Kapacitor - also offers a software technology stack for end-to-end data lifecycle services, focused on time series data

When exploring these technologies/services, you can note the following things:

Which services have similar functionality (and built from similar/the same software)?
How a service is coupled with underlying cloud infrastructures?
What are the coupling models/dependencies among services?
What are constraints/properties w.r.t. price, privacy, security, programming support, etc.

Then try to address the following points:

Why do you think that the tools in your personal techradar are suitable for your work/study?
What would possible issues when we use a single software stack?

Learning from the real industries

Since the industry has a lot of data, many companies deal with big data daily and they report various interesting situations. We should try to follow what happens in the real-world. Examples:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open source big data platforms

Building the big data infrastructure

A reference of Big Data Platform for our study

Build your personal techradar for big data platforms

Potential software technologies and services for big data platforms

Learning from the real industries

FilesExpand file tree

practicebdp.md

Latest commit

History

practicebdp.md

File metadata and controls

Open source big data platforms

Building the big data infrastructure

A reference of Big Data Platform for our study

Build your personal techradar for big data platforms

Potential software technologies and services for big data platforms

Learning from the real industries