Building the Stars: A PostgreSQL-Powered Database with NLP and Interactive UI

Alek Norris
Dec 17, 2024
4 min read

Figure 1: Binary Stars - Two stars interacting with each other — **Figure 1:** Binary Stars - Two stars interacting with each other

Project Story

As I approach the final days of my academic journey, I've been given an incredible opportunity to work on a year-long project with fellow graduating seniors, sponsored by our school. We're collaborating with a former member of the POSYDON project, which focuses the studying of binary stars, there interactions and researching how they reflect our own known laws. Our project is far from complete, and this story will continue to evolve as we progress through the year.

The Problem

Astrophysicists studying binary stars use a program called POSYDON to simulate binary star interactions using advanced machine learning techniques. POSYDON generates a large amount of data (terabytes), which is extremely challenging to search through, considering it's in an .H5 format (think of it as a semi-structured CSV file). Our team is developing a database solution with an intuitive and easy-to-use front end, enhanced with NLP (Natural Language Processing), to enable new and efficient ways to sort and find the relevant information.

Deliverables

Web-Based Application: A user-friendly frontend UI for querying and displaying results.
Relational Database: A Database for storing parsed POSYDON data.
Data Ingestion Tool: Scripts and utilities for parsing .H5 and CSV files into the database.
Documentation: Instructions for using the system and guidelines for future development.

Our Approach

To address the challenges of working with POSYDON's large datasets, our team is designing and implementing a solution that bridges the gap between complex data formats and user accessibility. Our goal is to create a program that not only supports the talented scientists involved in this project but also engages students and educators interested in studying and researching the stars. While the program is designed specifically for binary star data, it will be adaptable to work with other datasets in the future.

Project Overview and Technology Stack

To bring our vision to life, we've outlined a structured plan that combines robust tools and modern technologies commonly used across the industry in web application development, automation, and database building.

Figure 2: System Architecture Overview — **Figure 2:** System Architecture Overview

Areas of Concern

For this design our team came up with four major errors of concern needing to be addressed. For additional questions or confusion, please refer to Figure 2.

UI: This is the frontend, the Stuff the users will see and actively interact with.
Database: This the Database itself.
SQL Querying System: This is the system that assists users in generating queries, formatting them, takes advantage of the NLP, or ensures the query is valid to send against the database.
Data Parsing: This is the area where the .H5 and csv files will be parsed, cleaned, normalized, and uploaded into the database in a way that reflects their intended structure.

Interconnections

To accomplish these tasks, we broke the project into three pieces: Frontend, Backend, and Database.

Frontend: This deals with all user interactions. It deals with taking in user input, as well as displaying and allowing easy access to our users.
1. Selected Technologies: React + TypeScript
Backend: This deals with the handoff from the frontend, to the Database. parsing, querying, updating the database, and implementing of the NLP model.
1. Selected Technologies: Python, and OpenAI
Database: This is where the data is actually stored, sorted, and can be retrieved.
1. Selected Technologies: PostgreSQL

Chosen Tech Stack

Frontend: React + TypeScript
1. Chosen for its modern, component-based architecture and strong typing with TypeScript, ensuring a scalable and maintainable UI for data visualization and user interactions.
Backend: Python
1. Chosen because python offers robust support for data processing and integration with tools like h5py and NLP libraries, making it ideal for parsing, querying, and managing large datasets.
Database: PostgreSQL
1. PostgreSQL provides a reliable, scalable relational database with strong support for complex queries and structured data, ensuring efficient storage and retrieval of POSYDON simulation data.
NLP/LLM model: OpenAI ChatGPT
1. OpenAI's models enable natural language processing to allow users to query the database in plain English, enhancing accessibility for researchers and students without SQL expertise.

Current Progress

Data Parsing: Successfully parsed initial .H5 files into a test database.
Frontend Prototype: A basic UI for user interactions has been implemented using React. It contains the ability to query a local host Postgre Database, save queries for future use, as well as displayed the returned data.
Backend Development: Early integration of NLP queries for generating SQL statements. Initial Python scripts with OpenAPI's API have already been created. However, we are still awaiting funding from the university to begin using, and testing the NLP design.
Database Setup: PostgreSQL is up and running with normalized sample data provided from the POSYDON projects main page.
Figure 4: Current Progress

Upcoming Milestones

Finalize data parsing scripts to handle large .H5 datasets efficiently.
Implement advanced NLP features for natural language queries.
Refine the frontend UI for better user experience and data visualization.
Conduct system testing to ensure smooth interactions between the frontend, backend, and database.

Testing and Plans for Expansion

Testing: Focus on verifying the accuracy of queries, data integrity, and user interactions. for these tests we have decided on industry standard technologies which can be found below:
1. Backend
  1. Unit testing: Tox
  2. Integration Testing: Tox
  3. System Testing: Tox
2. Frontend
  1. Unit Testing: Jest
  2. Integration Testing: Cypress
  3. System Testing: Cypress
Future Plans: Since this is a multi-semester project, we’re prioritizing a modular design to make future additions (e.g., security, scalability, and full deployment) straightforward.