This repository contains the work done in our Open Weight Large Language Models Safety (LLM) Lab. Our main goal is to test different open weight models and tools for text classification, summarization, extraction, and question-answering, particularly in the context of academic papers and research documents. We also work on retrieving research papers from Semantic Scholar and other sources.
├──.env
├──.git
├──.gitignore
├──amazon_comprehend
└──Building_AI_Papers_Repository_by_Industry
├──Download_arXiV
└──Retrieve_ExternPaperID_arXiv
└──datasets
├──03_30 Scopus Papers Database.csv
├──df_with_overlap_and_id.csv
├──exceeding_tokens.csv
├──fewshot_industry.csv
├──groq_mixtral_answers.csv
├──HF_dataset_4_11_cleaned.csv
├──llm_training_data_fta.csv
├──llm_training_data_full_text.csv
├──mx_answers_by_abstract.csv
├──overlap_df_full_text.csv
├──ra_scopus_llm_full_text.csv
├──readme.md
├──Scopus_Database 4_11_2024.csv
├──scopus_full_text.csv
├──scopus_rag_answers.csv
├──scopus_rag_with_json.csv
└──semantic_man_verified_papers_full_text.csv
├──DB_Create_Script.sql
├──DB_ERD.png
├──hf_models.csv
├──hf_models.txt
├──hub_models.py
├──Insights.sql
├──JSON
├──LICENSE
├──llm_application_papers_data.sql
├──llm_ERD.pgerd
├──main.py
├──paper_insights.sql
└──pics
├──DL_percentiles.csv
├──models_by_task.png
├──number of popular models per task.png
└──tasks performed by popular models.png
├──playground.ipynb
└──RAG_mixtral
└──mixtral-milvus-rag.ipynb
├──README.md
├──Retrieving_Papers_Semantic_Scholar
└──Retrieving_ResearchPapers_SemanticScholar
├──Amazon_Comprehend
├──Step 2. Bulkdownload_SemanticScholars_Paperswithover1000
├──Step 3. RetrieveInformation_BulkDownload
├──Step 4. Create master dataset
└──Step1. SearchResultYield_isbelow_1000
├──Scopus_Papers_Cleanup.sql
├──tags.csv
└──testing
├──ollama.py
└──test.ipynb
└──text_inferencing
├──inferencing-phi-2.ipynb
├──inferencing_mistral-7B.ipynb
├──llama_8b_groq.ipynb
├──llm_training_data_full_text.json
├──mixtral_8x7b_groq.ipynb
├──mixtral_answers_by_abstract.ipynb
└──semantic_man_verified_full_text.json
└──text_extraction
├──pdf_text_extraction.ipynb
└──scopus_text_extraction.ipynb
└──text_summarization
├──text_summarization_using_bert.ipynb
└──text_summarization_using_phi-2.ipynb
├──text_sum_llm.py
├──Updated Google Scholar Code
└──utils
├──add_data_milvus.py
├──config_serverless.ini
├──GroqApi.py
├──OllamaApi.py
├──main.py
├──PDFExtractor.py
├──rag-tester.py
├──RAG_Milvus.py
├──TextSummarizer.py
└──JSONExtractor.py