Firoz Shaik fshaik8

Hi there! 👋 I'm Firoz Shaik

Welcome to my GitHub profile! I'm a passionate Computer Science graduate student at the University of Illinois at Chicago, with a keen interest in cutting-edge technologies like Data Mining, NLP, and Data Science. I’m currently doing my thesis research at the University of Illinois at Chicago (UIC) with a focus on NLP, under the guidance of Prof. Cornelia Caragea. My research centers around Natural Language Understanding (NLU) and Natural Language Inference (NLI) on scientific texts, leveraging Large Language Models (LLMs). I’m working extensively with deep learning techniques, transfer learning, and domain adaptation to improve the understanding of complex scientific language. Before joining the UIC, I spent a year at Infosys and over one and half year at Tata Consultancy Services, where I specialized as an Azure Data Engineer. I worked extensively with Azure Data Factory, building and optimizing scalable data ingestion pipelines and automating processes for improved efficiency. I also gained hands-on experience with CI/CD implementation through Azure DevOps, and tackled complex SQL development tasks for managing data transformations and improving data quality. Here's a glimpse of what I'm up to:

🎓 Education

Master of Science in Computer Science, University of Illinois at Chicago (Aug'23 – May'25)
Bachelor of Technology in Computer Science and Engineering, R.V.R & J.C College of Engineering, Andhra Pradesh, India (Jul'16 – Sep'20)

💼 Work Experience

University of Illinois Chicago, USA (Aug'24 – Present)

Developed 'MISMATCHED,' a novel 2,700-pair OOD scientific NLI benchmark (Psychology, Engineering, Public Health); our work accepted to ACL Findings 2025 established that fine-tuned SLMs (SciBERT 78.17% Macro F1) significantly outperform even SOTA prompted LLMs like GPT-4o (~63%).
Led comprehensive evaluations on MISMATCHED (benchmarking diverse models incl. GPT-4o, Gemini 1.5 Pro & validating dataset integrity) and innovated an implicit relations technique boosting SciBERT by 1.5% Macro F1, underscoring SLM superiority in OOD Scientific NLI.
Currently investigating methods to integrate graph structures (knowledge graphs) encoded through Graph Neural Networks into Large Language Models for enhanced NLP performance; also exploring model explainability, robustness, and calibration to improve deep learning reliability and interpretability.

Infosys Limited, Bengaluru, India (Jul'22 – Jul'23)

Improved performance of Azure Databricks notebooks by 20% for 5 TB/day analytics tasks by tuning Apache Spark (executor memory, parallelism) and Python code; utilized ADLS Gen2 with Delta Lake for robust and efficient data processing.
Wrote advanced SQL scripts in Azure Synapse Analytics that boosted BI reporting efficiency by 30%, achieved through query streamlining and index/partitioning optimization; also created interactive Azure Power BI dashboards to aid business decision-making.
Built and maintained robust ETL pipelines using Azure Data Factory and Azure Databricks, handling automated data ingestion from 10+ sources; this work included ensuring data quality via validation processes, improving orchestration reliability by 15%, and managing deployments with Azure DevOps CI/CD.

Tata Consultancy Services, Bengaluru, India (Feb'21 – Jun'22)

Set up automated 24x7 data processing using Azure Data Factory for Prudential UK (M&G plc), significantly improving operational productivity by 25% (via elimination of manual tasks) and achieving 15% in operational cost savings through auto-pause for non-production Azure resources.
Focused on enhancing data quality by building Azure Data Factory pipelines to remediate technical debt in critical financial data systems, emphasizing robust data cleansing and validation; also wrote and maintained complex SQL stored procedures and UDFs for high data integrity.
Implemented CI/CD pipelines with Azure DevOps for Azure Data Factory, which cut deployment cycles by 33% and reduced manual errors; also designed scalable metadata-driven ingestion frameworks in ADF, improving pipeline reusability and accelerating development across numerous financial data sources.

🔧 Skills

Programming Languages: Python | SQL | Scala
Machine Learning & NLP: PyTorch | TensorFlow | scikit-learn | Hugging Face Transformers | NLTK | Spacy | LangChain | LlamaIndex
Data Science & Visualization: Numpy | Pandas | Matplotlib | Seaborn | Plotly | SciPy | Tableau | Power BI | Streamlit
Big Data & Cloud: Apache Spark (PySpark, Spark SQL) | Databricks | Apache Kafka | AWS (S3, EC2, Glue, Lambda, SageMaker, EMR) | Azure (ADLS Gen2, AzureML, Azure Functions)
Data Warehousing: Snowflake | Amazon Redshift | Azure Synapse Analytics
Data Engineering & MLOps: Azure Data Factory | Apache Airflow | Git | Docker | CI/CD Pipelines (Azure DevOps, GitHub Actions) | MLflow | Terraform
Databases: PostgreSQL | MySQL | MongoDB

🚀 Academic Projects

Optimizing Query Processing Algorithms for Enhanced Database Performance (Mar’24 – May’24)

Implemented a hash-based join algorithm to evaluate join queries between two relations, utilizing hash maps for rapid tuple access and ensuring accurate joins by matching tuples based on common attributes.
Implemented a line join algorithm using a simplified Yannakakis approach to process up to 10-line joins in optimal O(N + OUT) time, ensuring consistency and efficiency by structuring data processing through a systematic join tree.
Designed a sequential join algorithm to handle multiple relations linearly, enhancing query processing by applying hash joins incrementally across relational pairs, which streamlined the join process and reduced computational overhead.

Text Summarization with Attention-based Deep Recurrent Neural Networks - Statistical NLP (Feb’24 – May’24)

Implemented bidirectional LSTM/GRU encoder-decoder with Bahdanau attention on 230,000+ WikiHow articles. Achieved 18.3% improvement in ROUGE-L scores over baseline seq2seq models.
Developed comparative analysis framework evaluating attention weights between LSTM/GRU architectures. Demonstrated LSTM's superior performance with 12.4 points lower perplexity than GRU variants.
Engineered custom loss function combining cross-entropy and coverage penalty to reduce repetition. Implemented beam search decoding achieving ROUGE-1 (0.42), ROUGE-2 (0.19), ROUGE-L (0.38) scores.

Chatbot Development – Natural Language Processing (Aug'23 – Dec'23)

Developed an advanced chatbot using Python and machine learning models for the sentiment and stylistic analysis of user inputs. Integrated sklearn's Gaussian Naive Bayes, Logistic Regression, SVM, and MLP Classifier models for text classification.
Leveraged NLP techniques including TF-IDF vectorization and Word2Vec embeddings to enhance model accuracy in interpreting user sentiment and stylistic attributes from text.
Implemented feature extraction functions to analyze linguistic patterns and psycholinguistic correlates, applying nltk for tokenization and POS tagging.
Designed a multi-state dialogue system for interactive user engagement, enabling dynamic transitions between conversation states based on user responses.
Utilized pickle for model serialization and efficient data handling, ensuring robust performance and scalability of the chatbot system.

Sentiment Classification for 2012 U.S. Presidential Election Tweets – Data Mining & Text Mining (Aug'23 – Dec'23)

Developed a sentiment classification pipeline to analyze tweets about 2012 U.S. Presidential candidates Barack Obama and Mitt Romney, providing insights into public opinion during the election.
Conducted a comprehensive comparison of machine learning algorithms (SVM, Random Forest, KNN, Logistic Regression) and pre- trained models (VADER, twitter-roberta-base-sentiment, bertweet-base-sentiment-analysis) to determine the most effective method for accurate sentiment analysis.
Enhanced model accuracy by 10% through transfer learning techniques optimizing performance and reliability of sentiment predictions.

PUBLICATIONS

● Shaik, F., Sadat, M., Gautam, N., et al. "A MISMATCHED Benchmark for Scientific Natural Language Inference." Findings of the Association for Computational Linguistics (ACL), 2025. DOI;

🌱 Currently Learning

I'm currently diving deeper into Data Mining, Text Mining, and Natural Language Processing to further sharpen my skills and stay updated with the latest advancements in the field.

👀 I’m Interested In

I'm interested in exploring opportunities to apply my expertise in AI, Big Data, and Cloud technologies to solve real-world problems and contribute to innovative projects.

📫 How to Reach Me

Feel free to reach out to me via email at fshaik8@uic.edu or connect with me on LinkedIn!

😄 Pronouns

He/Him

⚡ Fun Fact

I enjoy experimenting with new recipes in my free time and consider cooking as a creative outlet!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly