Update README.md

2025-04-21 16:10:40 +00:00 · 2025-04-21 16:10:40 +00:00 · d1397b626e
commit d1397b626e
parent df04ef3527
1 changed files with 156 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1 +1,156 @@
-# AMS-Data-Mine
+# AMS-Data-Mine
+
+A migration and ETL pipeline to move legacy FileMaker Pro data into PostgreSQL, ingest and consolidate DGdata and EENDR survey CSVs (2010–2015), and maintain automated backups.
+
+## Repository Structure
+
+```
+.
+├── docker-compose.yml                # Defines Postgres (and optional tunnel) services
+├── .env                              # Environment variables (create locally)
+├── csv_postgres.py                   # Python script to import CSV files into Postgres
+├── clean.py                          # Utility to sanitize CSV column names
+├── backup_postgres_single_db.sh      # Backup script for Postgres (docker exec + pg_dump + gzip)
+├── sql/                              # SQL scripts for merging survey data
+│   ├── merge_dgdata.sql
+│   ├── merge_eendr.sql
+│   └── merge_surveys.sql
+└── README.md                         # This file
+```
+
+## Prerequisites
+
+- Docker & Docker Compose
+- Python 3.7+
+- pip packages:
+  ```bash
+  pip install pandas psycopg2-binary python-dotenv
+  ```
+- (Optional) Cloudflare Tunnel token for secure exposure
+
+## 1. Environment Setup
+
+1. Create a file named `.env` at the project root with the following:
+   ```ini
+   POSTGRES_USER=your_pg_username
+   POSTGRES_PASSWORD=your_pg_password
+   POSTGRES_DB=your_db_name
+   # If using cloudflared tunnel:
+   TUNNEL_TOKEN=your_cloudflare_tunnel_token
+   ```
+2. Start services:
+   ```bash
+   docker-compose up -d
+   ```
+
+This brings up:
+
+- **postgres**: PostgreSQL 15, port 5432
+- **cloudflared** (if configured): runs `tunnel run` to expose Postgres
+
+## 2. Migrating FileMaker Pro Data
+
+1. Export each FileMaker Pro table as a CSV file.
+2. (Optional) Clean column names to valid SQL identifiers:
+   ```bash
+   python3 clean.py path/to/input.csv path/to/output.csv
+   ```
+3. Place your CSV files into the host directory mounted by Docker (default `/home/ams/postgres/csv_files/`).
+
+## 3. Ingesting CSV Data
+
+Run the import script:
+
+```bash
+python3 csv_postgres.py
+```
+
+What it does:
+
+1. Reads all `.csv` files from `/home/ams/postgres/csv_files/`.
+2. Drops entirely empty columns and converts DataFrame types to `INTEGER`, `FLOAT`, or `TEXT`.
+3. Creates tables named `survey_data_<filename>` and inserts all rows.
+4. Moves processed CSVs to `/home/ams/postgres/csv_files_old/`.
+
+## 4. Merging Survey Data with SQL
+
+Place and edit your SQL merge scripts in the `sql/` directory. Example queries:
+
+- **sql/merge_dgdata.sql**
+  ```sql
+  DROP TABLE IF EXISTS dgdata_merged;
+  CREATE TABLE dgdata_merged AS
+    SELECT * FROM survey_data_dgdata_2010
+    UNION ALL
+    SELECT * FROM survey_data_dgdata_2011
+    -- ...repeat through 2015...
+  ;
+  ```
+
+- **sql/merge_eendr.sql**
+  ```sql
+  DROP TABLE IF EXISTS eendr_merged;
+  CREATE TABLE eendr_merged AS
+    SELECT * FROM survey_data_eendr_2010
+    UNION ALL
+    -- ...through 2015...
+  ;
+  ```
+
+- **sql/merge_surveys.sql**
+  ```sql
+  DROP TABLE IF EXISTS surveys_final;
+  CREATE TABLE surveys_final AS
+  SELECT
+    COALESCE(d.survey_id, e.survey_id) AS survey_id,
+    d.common_field1,
+    d.common_field2,
+    d.unique_dg_field,
+    e.unique_eendr_field
+  FROM dgdata_merged d
+  FULL OUTER JOIN eendr_merged e
+    USING (survey_id, common_field1, common_field2);
+  ```
+
+Columns not present in one survey will appear as `NULL`.
+
+Run any merge script with:
+
+```bash
+psql -h localhost -U $POSTGRES_USER -d $POSTGRES_DB -f sql/merge_surveys.sql
+```
+
+## 5. Automated Backups
+
+Backups are handled by `backup_postgres_single_db.sh`:
+
+```bash
+#!/bin/bash
+# Load variables
+source .env
+
+# Settings
+CONTAINER_NAME=postgres
+POSTGRES_USER=$POSTGRES_USER
+POSTGRES_PASSWORD=$POSTGRES_PASSWORD
+POSTGRES_DB=$POSTGRES_DB
+BACKUP_DIR=/home/ams/postgres/backups
+TIMESTAMP=$(date +"%Y%m%d%H%M%S")
+BACKUP_FILE=$BACKUP_DIR/${POSTGRES_DB}_backup_$TIMESTAMP.sql
+
+mkdir -p $BACKUP_DIR
+
+docker exec -e PGPASSWORD=$POSTGRES_PASSWORD -t $CONTAINER_NAME \
+  pg_dump -U $POSTGRES_USER $POSTGRES_DB > $BACKUP_FILE
+
+gzip $BACKUP_FILE
+# Optional retention:
+# find $BACKUP_DIR -type f -name "${POSTGRES_DB}_backup_*.sql.gz" -mtime +7 -delete
+```
+
+Schedule daily backups (e.g., at 3 AM) via cron:
+
+```cron
+0 3 * * * /path/to/backup_postgres_single_db.sh
+```
+