Go to file
2025-04-21 16:10:40 +00:00
.gitignore added gitignore 2025-01-08 21:43:57 +00:00
backup_postgres_single_db.sh first commit with project files 2025-01-08 21:45:13 +00:00
clean.py first commit with project files 2025-01-08 21:45:13 +00:00
csv_postgres.py first commit with project files 2025-01-08 21:45:13 +00:00
docker-compose.yml first commit with project files 2025-01-08 21:45:13 +00:00
README.md Update README.md 2025-04-21 16:10:40 +00:00

AMS-Data-Mine

A migration and ETL pipeline to move legacy FileMaker Pro data into PostgreSQL, ingest and consolidate DGdata and EENDR survey CSVs (20102015), and maintain automated backups.

Repository Structure

.
├── docker-compose.yml                # Defines Postgres (and optional tunnel) services
├── .env                              # Environment variables (create locally)
├── csv_postgres.py                   # Python script to import CSV files into Postgres
├── clean.py                          # Utility to sanitize CSV column names
├── backup_postgres_single_db.sh      # Backup script for Postgres (docker exec + pg_dump + gzip)
├── sql/                              # SQL scripts for merging survey data
│   ├── merge_dgdata.sql
│   ├── merge_eendr.sql
│   └── merge_surveys.sql
└── README.md                         # This file

Prerequisites

  • Docker & Docker Compose
  • Python 3.7+
  • pip packages:
    pip install pandas psycopg2-binary python-dotenv
    
  • (Optional) Cloudflare Tunnel token for secure exposure

1. Environment Setup

  1. Create a file named .env at the project root with the following:
    POSTGRES_USER=your_pg_username
    POSTGRES_PASSWORD=your_pg_password
    POSTGRES_DB=your_db_name
    # If using cloudflared tunnel:
    TUNNEL_TOKEN=your_cloudflare_tunnel_token
    
  2. Start services:
    docker-compose up -d
    

This brings up:

  • postgres: PostgreSQL 15, port 5432
  • cloudflared (if configured): runs tunnel run to expose Postgres

2. Migrating FileMaker Pro Data

  1. Export each FileMaker Pro table as a CSV file.
  2. (Optional) Clean column names to valid SQL identifiers:
    python3 clean.py path/to/input.csv path/to/output.csv
    
  3. Place your CSV files into the host directory mounted by Docker (default /home/ams/postgres/csv_files/).

3. Ingesting CSV Data

Run the import script:

python3 csv_postgres.py

What it does:

  1. Reads all .csv files from /home/ams/postgres/csv_files/.
  2. Drops entirely empty columns and converts DataFrame types to INTEGER, FLOAT, or TEXT.
  3. Creates tables named survey_data_<filename> and inserts all rows.
  4. Moves processed CSVs to /home/ams/postgres/csv_files_old/.

4. Merging Survey Data with SQL

Place and edit your SQL merge scripts in the sql/ directory. Example queries:

  • sql/merge_dgdata.sql

    DROP TABLE IF EXISTS dgdata_merged;
    CREATE TABLE dgdata_merged AS
      SELECT * FROM survey_data_dgdata_2010
      UNION ALL
      SELECT * FROM survey_data_dgdata_2011
      -- ...repeat through 2015...
    ;
    
  • sql/merge_eendr.sql

    DROP TABLE IF EXISTS eendr_merged;
    CREATE TABLE eendr_merged AS
      SELECT * FROM survey_data_eendr_2010
      UNION ALL
      -- ...through 2015...
    ;
    
  • sql/merge_surveys.sql

    DROP TABLE IF EXISTS surveys_final;
    CREATE TABLE surveys_final AS
    SELECT
      COALESCE(d.survey_id, e.survey_id) AS survey_id,
      d.common_field1,
      d.common_field2,
      d.unique_dg_field,
      e.unique_eendr_field
    FROM dgdata_merged d
    FULL OUTER JOIN eendr_merged e
      USING (survey_id, common_field1, common_field2);
    

Columns not present in one survey will appear as NULL.

Run any merge script with:

psql -h localhost -U $POSTGRES_USER -d $POSTGRES_DB -f sql/merge_surveys.sql

5. Automated Backups

Backups are handled by backup_postgres_single_db.sh:

#!/bin/bash
# Load variables
source .env

# Settings
CONTAINER_NAME=postgres
POSTGRES_USER=$POSTGRES_USER
POSTGRES_PASSWORD=$POSTGRES_PASSWORD
POSTGRES_DB=$POSTGRES_DB
BACKUP_DIR=/home/ams/postgres/backups
TIMESTAMP=$(date +"%Y%m%d%H%M%S")
BACKUP_FILE=$BACKUP_DIR/${POSTGRES_DB}_backup_$TIMESTAMP.sql

mkdir -p $BACKUP_DIR

docker exec -e PGPASSWORD=$POSTGRES_PASSWORD -t $CONTAINER_NAME \
  pg_dump -U $POSTGRES_USER $POSTGRES_DB > $BACKUP_FILE

gzip $BACKUP_FILE
# Optional retention:
# find $BACKUP_DIR -type f -name "${POSTGRES_DB}_backup_*.sql.gz" -mtime +7 -delete

Schedule daily backups (e.g., at 3AM) via cron:

0 3 * * * /path/to/backup_postgres_single_db.sh