Parquet batch ingestion with Python, Postgres and Docker.

Photo by Alex Chumak on Unsplash

Parquet batch ingestion with Python, Postgres and Docker.

Introduction

We'll use Python to ingest into a Postgres database the NYC TLC green taxi trip records for September 2019, which are available in a Parquet file at the following URL:
(https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2019-09.parquet)

The ingestion will be done in batches which allows us to efficiently ingest large files when the memory size of the compute resource being used is limited.

We will also ingest a small csv file containing taxi zone data for NYC. Finally we will perform SQL queries using pgadmin on the ingested data.

The GitHub repository of this project can be found at the following URL:
(https://github.com/Aditya-Tiwari-07/parquet_batch_ingestion)


Development Environment Exploration and Setup for GitHub Codespaces.

💡
Codespaces is one of the most stable and well behaved remote Linux development environments, especially if you love developing in VS Code. Its ridiculously easy to use along with your GitHub repository. It comes preinstalled with docker, python, git and many other software packages.

Fork the GitHub repository of the project:
(https://github.com/Aditya-Tiwari-07/parquet_batch_ingestion)

Open a codespace with the forked repository. Refer the below video to launch a codespace: -

Explore the codespace: -

  1. In the codespace terminal. Check if docker, python and git are installed, using docker -v, python --version and git -v commands respectively. You should get their respective versions if they are installed.

  2. Run docker image ls and docker ps -a to check whether any images or containers are present in the codespace.

  3. Run which python to see the path to the the location where the python binaries are stored, for the python kernel being used by you.

Explore the files in the project directory /workspaces/parquet_batch_ingestion :

  1. postgres.yaml: Compose file for the postgres service.

  2. ingest.yaml: Compose file for the ingestion service.

  3. pgadmin.yaml: Compose file for the pgadmin service.

  4. ingest_data.py: Python script for downloading the parquet file and ingesting it in batches to the postgres database.

  5. dockerfile: Dockerfile for dockerizing the ingest_data.py script and creating the taxi_ingest:v1 image.

  6. dev.env: Default environment variable file.

  7. zone_ingest.py: Python script for downloading and ingesting into postgres a small csv file containing taxi zone data for NYC.

  8. .gitignore: Contains files and folders to be ignored by git version control.

  9. README.md: Markdown file with brief description of project.

  10. LICENSE: The Apache license file for project code.

Set up the codespace for the project: -

  1. In the terminal, make sure you're in the project directory:

    /workspaces/parquet_batch_ingestion

  2. Run mkdir green_taxi_data && mkdir pgadmin_data to create the bind mounts to persist data for postgres and pgadmin.

  3. Run cp dev.env .env to copy the default environment variable file to a .env file. You may change these values as per your requirements.

  4. Run source .env to source the environment variables for the current terminal.

  5. Run pip install sqlalchemy psycopg2 python-dotenv to install the python libraries needed by the codespace for this project.

  6. Run docker build -t taxi_ingest:v1 . to build the taxi_ingest:v1 docker image. This image needs to be rebuilt every time the ingest_data.py file is changed.


Parquet file batch ingestion and querying the data from Postgres.

💡
Docker Compose helps in deploying (usually in a development environment) multi container applications in an easy and organized way, by configuring the container deployments as services in yaml files.

  1. Run docker compose -f postgres.yaml -f ingest.yaml -f pgadmin.yaml up:

    1. This pulls the postgres:13 and dpage/pgadmin4 docker images from the online docker registry. We already have the taxi_ingest:v1 image in our local docker registry.

    2. Creates the network pg-network and the respective containers of the pg-database, pg-ingest and pgadmin services.

    3. Attaches the network to the three containers.

    4. Starts the three containers.

  2. If the ingestion is successful and the postgres and pgadmin services are up, go to the ports tab of the container and open the port 8080 for pgadmin in the browser.

  3. Login to pgadmin, create a server for the postgres database and check if the green_taxi_data table has been ingested. If yes, try running some queries on the table.

  4. Use docker compose -f postgres.yaml -f ingest.yaml -f pgadmin.yaml down to compose down the three containers and their network.

  5. Execute docker compose -f postgres.yaml -f pgadmin.yaml up to relaunch the postgres and pgadmin containers in pg-network to check whether the server configuration and ingested data persist in the local bind mounts pgadmin_data and green_taxi_data respectively.

  6. Execute python zone_ingest.py to ingest the small csv file for the NYC taxi zone data, which happens in one go.

  7. Check the zones table in the postgres database using the running pgadmin service.

We now have a robust setup for downloading and ingesting large parquet files containing taxi trip record data, in batches, and analyzing it using SQL queries through pgadmin.