Your Cart
Loading
Only -1 left

Effective data management with a lakehouse in Python

On Sale
$9.99
$9.99
Added to cart

Data scientists in the real world have to manage messy datasets that evolve over time. New data must be added, old data must be removed and changes to columns must be handled gracefully. Furthermore, many real world datasets grow from a size that works on a laptop to a size that must run on a server.


This course teaches you how to meet these challenges in a simple and scalable way using the open source deltalake package to manage the data storage and Polars or Pandas to query the dataset. By taking this course you will get an introduction to fundamental data engineering concepts that will allow you to spend less time managing data and more time getting value from data. By taking this course you will boost your pipelines and your career.


The topics covered by the course include:

  • what a lakehouse is and when a lakehouse should be used
  • what a Parquet file is and why it is a useful format for data analytics
  • how to create a lakehouse table from a Pandas or Polars DataFrame
  • how to append or overwrite a lakehouse table with new data
  • how to optimize a lakehouse table with many small files for faster queries
  • how to insert, update or delete rows in a lakehouse table with new data
  • how to query a lakehouse table in the most performant way from Pandas or Polars
  • how to monitor and visualise operations on a lakehouse table
  • how to create and update a partitioned table
  • how to query a partitioned a lakehouse table in the most performant way from Pandas or Polars


Each topic is covered in a Jupyter notebook with step-by-step guidance through the key concepts. Each notebook ends with exercises from a real-world dataset to develop your understanding of the key concepts.


Some further points to be aware of when deciding if this course if for you:

  1. The lakehouse approach works for tabular datasets, basically anything that can be stored in a CSV
  2. All you need for this course are a recent version of python (python 3.10+ is supported) and basic experience of using Pandas or Polars. If you can read a CSV with one of these libraries that should be sufficient. The installation instructions are included with the downloaded materials.
  3. The course content focuses on how to effectively use the deltalake package for a dataset with Pandas and Polars. You can then apply this experience to building a lakehouse system for your own datasets, but designing these systems is not covered.



You will get a ZIP (2MB) file

Customer Reviews

There are no reviews yet.