160 lines
4.2 KiB
Markdown
160 lines
4.2 KiB
Markdown
# AMS-Data-Mine
|
||
|
||
A migration and ETL pipeline to move legacy FileMaker Pro data into PostgreSQL, ingest and consolidate DGdata and EENDR survey CSVs (2010–2015), and maintain automated backups.
|
||
|
||

|
||
|
||
|
||
## Repository Structure
|
||
|
||
```
|
||
.
|
||
├── docker-compose.yml # Defines Postgres (and optional tunnel) services
|
||
├── .env # Environment variables (create locally)
|
||
├── csv_postgres.py # Python script to import CSV files into Postgres
|
||
├── clean.py # Utility to sanitize CSV column names
|
||
├── backup_postgres_single_db.sh # Backup script for Postgres (docker exec + pg_dump + gzip)
|
||
├── sql/ # SQL scripts for merging survey data
|
||
│ ├── merge_dgdata.sql
|
||
│ ├── merge_eendr.sql
|
||
│ └── merge_surveys.sql
|
||
└── README.md # This file
|
||
```
|
||
|
||
## Prerequisites
|
||
|
||
- Docker & Docker Compose
|
||
- Python 3.7+
|
||
- pip packages:
|
||
```bash
|
||
pip install pandas psycopg2-binary python-dotenv
|
||
```
|
||
- (Optional) Cloudflare Tunnel token for secure exposure
|
||
|
||
## 1. Environment Setup
|
||
|
||
1. Create a file named `.env` at the project root with the following:
|
||
```ini
|
||
POSTGRES_USER=your_pg_username
|
||
POSTGRES_PASSWORD=your_pg_password
|
||
POSTGRES_DB=your_db_name
|
||
# If using cloudflared tunnel:
|
||
TUNNEL_TOKEN=your_cloudflare_tunnel_token
|
||
```
|
||
2. Start services:
|
||
```bash
|
||
docker-compose up -d
|
||
```
|
||
|
||
This brings up:
|
||
|
||
- **postgres**: PostgreSQL 15, port 5432
|
||
- **cloudflared** (if configured): runs `tunnel run` to expose Postgres
|
||
|
||
## 2. Migrating FileMaker Pro Data
|
||
|
||
1. Export each FileMaker Pro table as a CSV file.
|
||
2. (Optional) Clean column names to valid SQL identifiers:
|
||
```bash
|
||
python3 clean.py path/to/input.csv path/to/output.csv
|
||
```
|
||
3. Place your CSV files into the host directory mounted by Docker (default `/home/ams/postgres/csv_files/`).
|
||
|
||
## 3. Ingesting CSV Data
|
||
|
||
Run the import script:
|
||
|
||
```bash
|
||
python3 csv_postgres.py
|
||
```
|
||
|
||
What it does:
|
||
|
||
1. Reads all `.csv` files from `/home/ams/postgres/csv_files/`.
|
||
2. Drops entirely empty columns and converts DataFrame types to `INTEGER`, `FLOAT`, or `TEXT`.
|
||
3. Creates tables named `survey_data_<filename>` and inserts all rows.
|
||
4. Moves processed CSVs to `/home/ams/postgres/csv_files_old/`.
|
||
|
||
## 4. Merging Survey Data with SQL
|
||
|
||
Example queries for merging tables:
|
||
|
||
- **sql/merge_dgdata.sql**
|
||
```sql
|
||
DROP TABLE IF EXISTS dgdata_merged;
|
||
CREATE TABLE dgdata_merged AS
|
||
SELECT * FROM survey_data_dgdata_2010
|
||
UNION ALL
|
||
SELECT * FROM survey_data_dgdata_2011
|
||
-- ...repeat through 2015...
|
||
;
|
||
```
|
||
|
||
- **sql/merge_eendr.sql**
|
||
```sql
|
||
DROP TABLE IF EXISTS eendr_merged;
|
||
CREATE TABLE eendr_merged AS
|
||
SELECT * FROM survey_data_eendr_2010
|
||
UNION ALL
|
||
-- ...through 2015...
|
||
;
|
||
```
|
||
|
||
- **sql/merge_surveys.sql**
|
||
```sql
|
||
DROP TABLE IF EXISTS surveys_final;
|
||
CREATE TABLE surveys_final AS
|
||
SELECT
|
||
COALESCE(d.survey_id, e.survey_id) AS survey_id,
|
||
d.common_field1,
|
||
d.common_field2,
|
||
d.unique_dg_field,
|
||
e.unique_eendr_field
|
||
FROM dgdata_merged d
|
||
FULL OUTER JOIN eendr_merged e
|
||
USING (survey_id, common_field1, common_field2);
|
||
```
|
||
|
||
Columns not present in one survey will appear as `NULL`.
|
||
|
||
Run any merge script with:
|
||
|
||
```bash
|
||
psql -h localhost -U $POSTGRES_USER -d $POSTGRES_DB -f sql/merge_surveys.sql
|
||
```
|
||
|
||
## 5. Automated Backups
|
||
|
||
Backups are handled by `backup_postgres_single_db.sh`:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# Load variables
|
||
source .env
|
||
|
||
# Settings
|
||
CONTAINER_NAME=postgres
|
||
POSTGRES_USER=$POSTGRES_USER
|
||
POSTGRES_PASSWORD=$POSTGRES_PASSWORD
|
||
POSTGRES_DB=$POSTGRES_DB
|
||
BACKUP_DIR=/home/ams/postgres/backups
|
||
TIMESTAMP=$(date +"%Y%m%d%H%M%S")
|
||
BACKUP_FILE=$BACKUP_DIR/${POSTGRES_DB}_backup_$TIMESTAMP.sql
|
||
|
||
mkdir -p $BACKUP_DIR
|
||
|
||
docker exec -e PGPASSWORD=$POSTGRES_PASSWORD -t $CONTAINER_NAME \
|
||
pg_dump -U $POSTGRES_USER $POSTGRES_DB > $BACKUP_FILE
|
||
|
||
gzip $BACKUP_FILE
|
||
# Optional retention:
|
||
# find $BACKUP_DIR -type f -name "${POSTGRES_DB}_backup_*.sql.gz" -mtime +7 -delete
|
||
```
|
||
|
||
Schedule daily backups (e.g., at 3 AM) via cron:
|
||
|
||
```cron
|
||
0 3 * * * /path/to/backup_postgres_single_db.sh
|
||
```
|
||
|