GitHub

The Data Engineering Handbook

What is the project about?

This project is a comprehensive resource repository for aspiring and current data engineers. It provides a curated collection of materials, including roadmaps, tutorials, projects, interviews, book recommendations, community links, company lists, blogs, whitepapers, social media accounts, podcasts, newsletters, glossaries, design patterns, courses, and certification information.

What problem does it solve?

It solves the problem of scattered and overwhelming information for individuals interested in data engineering. It centralizes essential resources, making it easier to learn, stay updated, and advance in the field. It provides a clear path for beginners and valuable resources for experienced professionals.

What are the features of the project?

  • Learning Roadmap: A structured guide for breaking into data engineering.
  • Bootcamp Materials: Resources for a 6-week free YouTube bootcamp.
  • Hands-on Projects: Practical examples for applied learning.
  • Interview Advice: Guidance on passing data engineering interviews.
  • Curated Lists: Extensive lists of high-quality books, communities, newsletters, and companies.
  • Company Resources: Links to data engineering blogs and whitepapers from leading tech companies.
  • Social Media Directory: A comprehensive list of data engineering influencers and creators.
  • Podcast Recommendations: A selection of relevant podcasts for staying informed.
  • Glossaries: Definitions of key data engineering terms.
  • Design Patterns: Examples of common data engineering design patterns.
  • Course and Certification Links: Information on relevant courses and certifications.

What are the technologies used in the project?

The project itself is a curated list of resources, so it doesn't use technologies in the same way a software project would. However, it covers a vast range of data engineering technologies, including (but not limited to):

  • Orchestration: Mage, Astronomer, Prefect, Dagster, Airflow, Kestra, Shipyard, Hamilton.
  • Data Lake/Cloud: Tabular, Microsoft, Databricks, Onehouse, Delta Lake.
  • Data Warehouse: Snowflake, Firebolt, Databend.
  • Data Quality: dbt, Gable, Great Expectations, Streamdal, Coalesce, Soda, DQOps, HEDDA.IO.
  • Analytics/Visualization: Preset, Starburst, Metabase, Looker Studio, Tableau, Power BI, Hex, Apache Superset, Evidence, Redash, Lightdash.
  • Data Integration: Cube, Fivetran, Airbyte, dlt, Sling, Meltano.
  • Modern OLAP: Apache Druid, ClickHouse, Apache Pinot, Apache Kylin, DuckDB, QuestDB, StarRocks.
  • LLM Libraries: AdalFlow, LangChain, LlamaIndex.
  • Real-Time Data: Aggregations.io, Responsive, RisingWave, Striim.
  • Cloud Platforms: AWS, Azure, GCP

What are the benefits of the project?

  • Centralized Information: Provides a single source of truth for data engineering resources.
  • Structured Learning: Offers a clear roadmap for beginners.
  • Practical Application: Includes hands-on projects for skill development.
  • Career Advancement: Helps with interview preparation and professional growth.
  • Community Building: Connects users with relevant communities and influencers.
  • Staying Updated: Provides access to the latest trends and technologies through newsletters, blogs, and podcasts.
  • Time Saving: Curated lists save time searching for quality resources.

What are the use cases of the project?

  • Aspiring Data Engineers: Individuals looking to start a career in data engineering.
  • Current Data Engineers: Professionals seeking to expand their knowledge, learn new technologies, or advance their careers.
  • Students: Learners enrolled in data engineering courses or programs.
  • Educators: Instructors looking for resources to supplement their teaching.
  • Anyone interested in learning about data engineering.
data-engineer-handbook screenshot