Data Engineering

Collect and transform data on a large scale. Build data pipelines, work with a horizontally scalable architecture, or simply scrape and collect data.

Google BigQuery's Python SDK: Creating Tables Programmatically

Google BigQuery's Python SDK: Creating Tables Programmatically

Use Google Cloud's Python SDK to insert large datasets into Google BigQuery, enjoy the benefits of schema detection, and manipulating data programmatically.

Todd Birchard
Todd Birchard
Google Cloud
Scrape Structured Data with Python and Extruct

Scrape Structured Data with Python and Extruct

Supercharge your scraper to extract quality page metadata by parsing JSON-LD data via Python's extruct library.

Todd Birchard
Todd Birchard
Python
Simplify BigQuery ETL jobs using SQLAlchemy

Simplify BigQuery ETL jobs using SQLAlchemy

Extract and move data between BigQuery and relational databases using PyBigQuery: a connector for SQLAlchemy.

Todd Birchard
Todd Birchard
Data Warehouses
Using Amazon Redshift as your Data Warehouse

Using Amazon Redshift as your Data Warehouse

Get the most out of Redshift by performance tuning your cluster and learning how to query your data optimally.

Todd Birchard
Todd Birchard
Data Warehouses
Join and Aggregate PySpark DataFrames

Join and Aggregate PySpark DataFrames

Perform SQL-like joins and aggregations on your PySpark DataFrames.

Todd Birchard
Todd Birchard
Spark
Working with PySpark RDDs

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.

Todd Birchard
Todd Birchard
Spark
Manage Data Pipelines with Apache Airflow

Manage Data Pipelines with Apache Airflow

Use Apache Airflow to build and monitor better data pipelines.

Todd Birchard
Todd Birchard
Apache
Structured Streaming in PySpark

Structured Streaming in PySpark

Become familiar with building a structured stream in PySpark using the Databricks interface.

Todd Birchard
Todd Birchard
Spark