AMS_DATA_MINE/README.md
2025-04-21 16:23:08 +00:00

157 lines
4.1 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AMS-Data-Mine
A migration and ETL pipeline to move legacy FileMaker Pro data into PostgreSQL, ingest and consolidate DGdata and EENDR survey CSVs (20102015), and maintain automated backups.
## Repository Structure
```
.
├── docker-compose.yml # Defines Postgres (and optional tunnel) services
├── .env # Environment variables (create locally)
├── csv_postgres.py # Python script to import CSV files into Postgres
├── clean.py # Utility to sanitize CSV column names
├── backup_postgres_single_db.sh # Backup script for Postgres (docker exec + pg_dump + gzip)
├── sql/ # SQL scripts for merging survey data
│ ├── merge_dgdata.sql
│ ├── merge_eendr.sql
│ └── merge_surveys.sql
└── README.md # This file
```
## Prerequisites
- Docker & Docker Compose
- Python 3.7+
- pip packages:
```bash
pip install pandas psycopg2-binary python-dotenv
```
- (Optional) Cloudflare Tunnel token for secure exposure
## 1. Environment Setup
1. Create a file named `.env` at the project root with the following:
```ini
POSTGRES_USER=your_pg_username
POSTGRES_PASSWORD=your_pg_password
POSTGRES_DB=your_db_name
# If using cloudflared tunnel:
TUNNEL_TOKEN=your_cloudflare_tunnel_token
```
2. Start services:
```bash
docker-compose up -d
```
This brings up:
- **postgres**: PostgreSQL 15, port 5432
- **cloudflared** (if configured): runs `tunnel run` to expose Postgres
## 2. Migrating FileMaker Pro Data
1. Export each FileMaker Pro table as a CSV file.
2. (Optional) Clean column names to valid SQL identifiers:
```bash
python3 clean.py path/to/input.csv path/to/output.csv
```
3. Place your CSV files into the host directory mounted by Docker (default `/home/ams/postgres/csv_files/`).
## 3. Ingesting CSV Data
Run the import script:
```bash
python3 csv_postgres.py
```
What it does:
1. Reads all `.csv` files from `/home/ams/postgres/csv_files/`.
2. Drops entirely empty columns and converts DataFrame types to `INTEGER`, `FLOAT`, or `TEXT`.
3. Creates tables named `survey_data_<filename>` and inserts all rows.
4. Moves processed CSVs to `/home/ams/postgres/csv_files_old/`.
## 4. Merging Survey Data with SQL
Example queries for merging tables:
- **sql/merge_dgdata.sql**
```sql
DROP TABLE IF EXISTS dgdata_merged;
CREATE TABLE dgdata_merged AS
SELECT * FROM survey_data_dgdata_2010
UNION ALL
SELECT * FROM survey_data_dgdata_2011
-- ...repeat through 2015...
;
```
- **sql/merge_eendr.sql**
```sql
DROP TABLE IF EXISTS eendr_merged;
CREATE TABLE eendr_merged AS
SELECT * FROM survey_data_eendr_2010
UNION ALL
-- ...through 2015...
;
```
- **sql/merge_surveys.sql**
```sql
DROP TABLE IF EXISTS surveys_final;
CREATE TABLE surveys_final AS
SELECT
COALESCE(d.survey_id, e.survey_id) AS survey_id,
d.common_field1,
d.common_field2,
d.unique_dg_field,
e.unique_eendr_field
FROM dgdata_merged d
FULL OUTER JOIN eendr_merged e
USING (survey_id, common_field1, common_field2);
```
Columns not present in one survey will appear as `NULL`.
Run any merge script with:
```bash
psql -h localhost -U $POSTGRES_USER -d $POSTGRES_DB -f sql/merge_surveys.sql
```
## 5. Automated Backups
Backups are handled by `backup_postgres_single_db.sh`:
```bash
#!/bin/bash
# Load variables
source .env
# Settings
CONTAINER_NAME=postgres
POSTGRES_USER=$POSTGRES_USER
POSTGRES_PASSWORD=$POSTGRES_PASSWORD
POSTGRES_DB=$POSTGRES_DB
BACKUP_DIR=/home/ams/postgres/backups
TIMESTAMP=$(date +"%Y%m%d%H%M%S")
BACKUP_FILE=$BACKUP_DIR/${POSTGRES_DB}_backup_$TIMESTAMP.sql
mkdir -p $BACKUP_DIR
docker exec -e PGPASSWORD=$POSTGRES_PASSWORD -t $CONTAINER_NAME \
pg_dump -U $POSTGRES_USER $POSTGRES_DB > $BACKUP_FILE
gzip $BACKUP_FILE
# Optional retention:
# find $BACKUP_DIR -type f -name "${POSTGRES_DB}_backup_*.sql.gz" -mtime +7 -delete
```
Schedule daily backups (e.g., at 3AM) via cron:
```cron
0 3 * * * /path/to/backup_postgres_single_db.sh
```