diff --git a/README.md b/README.md index 30261b6..6a7fb63 100644 --- a/README.md +++ b/README.md @@ -1 +1,156 @@ -# AMS-Data-Mine \ No newline at end of file +# AMS-Data-Mine + +A migration and ETL pipeline to move legacy FileMaker Pro data into PostgreSQL, ingest and consolidate DGdata and EENDR survey CSVs (2010–2015), and maintain automated backups. + +## Repository Structure + +``` +. +├── docker-compose.yml # Defines Postgres (and optional tunnel) services +├── .env # Environment variables (create locally) +├── csv_postgres.py # Python script to import CSV files into Postgres +├── clean.py # Utility to sanitize CSV column names +├── backup_postgres_single_db.sh # Backup script for Postgres (docker exec + pg_dump + gzip) +├── sql/ # SQL scripts for merging survey data +│ ├── merge_dgdata.sql +│ ├── merge_eendr.sql +│ └── merge_surveys.sql +└── README.md # This file +``` + +## Prerequisites + +- Docker & Docker Compose +- Python 3.7+ +- pip packages: + ```bash + pip install pandas psycopg2-binary python-dotenv + ``` +- (Optional) Cloudflare Tunnel token for secure exposure + +## 1. Environment Setup + +1. Create a file named `.env` at the project root with the following: + ```ini + POSTGRES_USER=your_pg_username + POSTGRES_PASSWORD=your_pg_password + POSTGRES_DB=your_db_name + # If using cloudflared tunnel: + TUNNEL_TOKEN=your_cloudflare_tunnel_token + ``` +2. Start services: + ```bash + docker-compose up -d + ``` + +This brings up: + +- **postgres**: PostgreSQL 15, port 5432 +- **cloudflared** (if configured): runs `tunnel run` to expose Postgres + +## 2. Migrating FileMaker Pro Data + +1. Export each FileMaker Pro table as a CSV file. +2. (Optional) Clean column names to valid SQL identifiers: + ```bash + python3 clean.py path/to/input.csv path/to/output.csv + ``` +3. Place your CSV files into the host directory mounted by Docker (default `/home/ams/postgres/csv_files/`). + +## 3. Ingesting CSV Data + +Run the import script: + +```bash +python3 csv_postgres.py +``` + +What it does: + +1. Reads all `.csv` files from `/home/ams/postgres/csv_files/`. +2. Drops entirely empty columns and converts DataFrame types to `INTEGER`, `FLOAT`, or `TEXT`. +3. Creates tables named `survey_data_` and inserts all rows. +4. Moves processed CSVs to `/home/ams/postgres/csv_files_old/`. + +## 4. Merging Survey Data with SQL + +Place and edit your SQL merge scripts in the `sql/` directory. Example queries: + +- **sql/merge_dgdata.sql** + ```sql + DROP TABLE IF EXISTS dgdata_merged; + CREATE TABLE dgdata_merged AS + SELECT * FROM survey_data_dgdata_2010 + UNION ALL + SELECT * FROM survey_data_dgdata_2011 + -- ...repeat through 2015... + ; + ``` + +- **sql/merge_eendr.sql** + ```sql + DROP TABLE IF EXISTS eendr_merged; + CREATE TABLE eendr_merged AS + SELECT * FROM survey_data_eendr_2010 + UNION ALL + -- ...through 2015... + ; + ``` + +- **sql/merge_surveys.sql** + ```sql + DROP TABLE IF EXISTS surveys_final; + CREATE TABLE surveys_final AS + SELECT + COALESCE(d.survey_id, e.survey_id) AS survey_id, + d.common_field1, + d.common_field2, + d.unique_dg_field, + e.unique_eendr_field + FROM dgdata_merged d + FULL OUTER JOIN eendr_merged e + USING (survey_id, common_field1, common_field2); + ``` + +Columns not present in one survey will appear as `NULL`. + +Run any merge script with: + +```bash +psql -h localhost -U $POSTGRES_USER -d $POSTGRES_DB -f sql/merge_surveys.sql +``` + +## 5. Automated Backups + +Backups are handled by `backup_postgres_single_db.sh`: + +```bash +#!/bin/bash +# Load variables +source .env + +# Settings +CONTAINER_NAME=postgres +POSTGRES_USER=$POSTGRES_USER +POSTGRES_PASSWORD=$POSTGRES_PASSWORD +POSTGRES_DB=$POSTGRES_DB +BACKUP_DIR=/home/ams/postgres/backups +TIMESTAMP=$(date +"%Y%m%d%H%M%S") +BACKUP_FILE=$BACKUP_DIR/${POSTGRES_DB}_backup_$TIMESTAMP.sql + +mkdir -p $BACKUP_DIR + +docker exec -e PGPASSWORD=$POSTGRES_PASSWORD -t $CONTAINER_NAME \ + pg_dump -U $POSTGRES_USER $POSTGRES_DB > $BACKUP_FILE + +gzip $BACKUP_FILE +# Optional retention: +# find $BACKUP_DIR -type f -name "${POSTGRES_DB}_backup_*.sql.gz" -mtime +7 -delete +``` + +Schedule daily backups (e.g., at 3 AM) via cron: + +```cron +0 3 * * * /path/to/backup_postgres_single_db.sh +``` +