Update README.md

2025-04-21 16:10:40 +00:00 · 2025-04-21 16:10:40 +00:00 · d1397b626e
commit d1397b626e
parent df04ef3527
1 changed files with 156 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1 +1,156 @@
 # AMS-Data-Mine
 A migration and ETL pipeline to move legacy FileMaker Pro data into PostgreSQL, ingest and consolidate DGdata and EENDR survey CSVs (2010–2015), and maintain automated backups.
 ## Repository Structure
 ```
 .
 ├── docker-compose.yml                # Defines Postgres (and optional tunnel) services
 ├── .env                              # Environment variables (create locally)
 ├── csv_postgres.py                   # Python script to import CSV files into Postgres
 ├── clean.py                          # Utility to sanitize CSV column names
 ├── backup_postgres_single_db.sh      # Backup script for Postgres (docker exec + pg_dump + gzip)
 ├── sql/                              # SQL scripts for merging survey data
 │   ├── merge_dgdata.sql
 │   ├── merge_eendr.sql
 │   └── merge_surveys.sql
 └── README.md                         # This file
 ```
 ## Prerequisites
 - Docker & Docker Compose
 - Python 3.7+
 - pip packages:
  ```bash
  pip install pandas psycopg2-binary python-dotenv
  ```
 - (Optional) Cloudflare Tunnel token for secure exposure
 ## 1. Environment Setup
 1. Create a file named `.env` at the project root with the following:
   ```ini
   POSTGRES_USER=your_pg_username
   POSTGRES_PASSWORD=your_pg_password
   POSTGRES_DB=your_db_name
   # If using cloudflared tunnel:
   TUNNEL_TOKEN=your_cloudflare_tunnel_token
   ```
 2. Start services:
   ```bash
   docker-compose up -d
   ```
 This brings up:
 - **postgres**: PostgreSQL 15, port 5432
 - **cloudflared** (if configured): runs `tunnel run` to expose Postgres
 ## 2. Migrating FileMaker Pro Data
 1. Export each FileMaker Pro table as a CSV file.
 2. (Optional) Clean column names to valid SQL identifiers:
   ```bash
   python3 clean.py path/to/input.csv path/to/output.csv
   ```
 3. Place your CSV files into the host directory mounted by Docker (default `/home/ams/postgres/csv_files/`).
 ## 3. Ingesting CSV Data
 Run the import script:
 ```bash
 python3 csv_postgres.py
 ```
 What it does:
 1. Reads all `.csv` files from `/home/ams/postgres/csv_files/`.
 2. Drops entirely empty columns and converts DataFrame types to `INTEGER`, `FLOAT`, or `TEXT`.
 3. Creates tables named `survey_data_<filename>` and inserts all rows.
 4. Moves processed CSVs to `/home/ams/postgres/csv_files_old/`.
 ## 4. Merging Survey Data with SQL
 Place and edit your SQL merge scripts in the `sql/` directory. Example queries:
 - **sql/merge_dgdata.sql**
  ```sql
  DROP TABLE IF EXISTS dgdata_merged;
  CREATE TABLE dgdata_merged AS
    SELECT * FROM survey_data_dgdata_2010
    UNION ALL
    SELECT * FROM survey_data_dgdata_2011
    -- ...repeat through 2015...
  ;
  ```
 - **sql/merge_eendr.sql**
  ```sql
  DROP TABLE IF EXISTS eendr_merged;
  CREATE TABLE eendr_merged AS
    SELECT * FROM survey_data_eendr_2010
    UNION ALL
    -- ...through 2015...
  ;
  ```
 - **sql/merge_surveys.sql**
  ```sql
  DROP TABLE IF EXISTS surveys_final;
  CREATE TABLE surveys_final AS
  SELECT
    COALESCE(d.survey_id, e.survey_id) AS survey_id,
    d.common_field1,
    d.common_field2,
    d.unique_dg_field,
    e.unique_eendr_field
  FROM dgdata_merged d
  FULL OUTER JOIN eendr_merged e
    USING (survey_id, common_field1, common_field2);
  ```
 Columns not present in one survey will appear as `NULL`.
 Run any merge script with:
 ```bash
 psql -h localhost -U $POSTGRES_USER -d $POSTGRES_DB -f sql/merge_surveys.sql
 ```
 ## 5. Automated Backups
 Backups are handled by `backup_postgres_single_db.sh`:
 ```bash
 #!/bin/bash
 # Load variables
 source .env
 # Settings
 CONTAINER_NAME=postgres
 POSTGRES_USER=$POSTGRES_USER
 POSTGRES_PASSWORD=$POSTGRES_PASSWORD
 POSTGRES_DB=$POSTGRES_DB
 BACKUP_DIR=/home/ams/postgres/backups
 TIMESTAMP=$(date +"%Y%m%d%H%M%S")
 BACKUP_FILE=$BACKUP_DIR/${POSTGRES_DB}_backup_$TIMESTAMP.sql
 mkdir -p $BACKUP_DIR
 docker exec -e PGPASSWORD=$POSTGRES_PASSWORD -t $CONTAINER_NAME \
  pg_dump -U $POSTGRES_USER $POSTGRES_DB > $BACKUP_FILE
 gzip $BACKUP_FILE
 # Optional retention:
 # find $BACKUP_DIR -type f -name "${POSTGRES_DB}_backup_*.sql.gz" -mtime +7 -delete
 ```
 Schedule daily backups (e.g., at 3 AM) via cron:
 ```cron
 0 3 * * * /path/to/backup_postgres_single_db.sh
 ```