Created by Nick Davila with the help from the awesome people at GEVIP
Explore the secondary research for this project»
Table of Contents
One of GEVIP’s (Galaxy Evolution Vertically Integrated Project) themes is working with HETDEX (Hobby-Eberly Telescope Dark Energy Experiment). HETDEX is an unbiased spectroscopic survey using the 10m Hobby Eberly Telescope (HET) and its VIRUS integral-field unit (IFU) spectrograph. HETDEX is in the process of discovering distant galaxies on the basis of their strong Lyman-α emissions. In some GEVIP projects, we use the discovered Lyman-α emitting galaxies with the goal of understanding how the Milky Way galaxy was formed. In order to get usable data, we work on classifying Lyman-α emitting galaxies from large sets of data which contain different astronomical objects. To classify, we divide astronomical objects into groups based on their visual appearance. However, data in astronomy is getting larger and more complex, so we are turning to machine learning algorithms that can adapt to increasingly large sets of data. Therefore, this project aims to train a Random Forest Classifier to classify astronomical spectra and differentiate between noise spectra and high-redshift galaxy spectra.
In order to maximize our discovery space, we need to push our detections to low signal-to-noise (very noisy data), so we need to find a robust way to differentiate between true astrophysical objects and noise features in the data catalog. Historically ML algorithms struggle with differentiating real spectra and noise spectra. The motivation for this project was to implement an algorithm to solve this problem specifically. This will allow for more high-redshift sources to be studied, which will help us learn more about the period of reionization in the universe.
We used the HETDEX HDR3 internal day release 3.0.1 and the HETDEX API: https://github.com/HETDEX/hetdex_api
Import the following libraries:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # statistical data visualization
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# Can use pip install for these
The next prerequisite step is importing data. We use an internal detections catalog for our high-redshift galaxies (sources that were visually classified and vetted by many people). For the noise data I took the HDR3 catalog (specifically used photometry) and extracted in the sections in the sky where there were no detections within 200 arcseconds.
-
Create a sample of high-redshift galaxies and noise sources. We create a data set of 20,000 sources, 10,000 being high-redshift galaxies and 10,000 noise sources.
-
For binary classification you need to label the data. We chose a '1' to mean high-redshift galaxy and a '0' to mean a noise source.
If you have a suggestion that would make this better, please fork the repo and create a pull request or you can also simply email me!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Nick Davila - ndavila@utexas.edu
- Very special thanks to Oscar A. Chavez Ortiz for guiding me throughout the entire project.
- Thank you to Gene Leung and Steven Finkelstein for their expert advice along the way.
- Thank you to all my peers in GEVIP for the tips and inspiration.
Distributed under the MIT License.
