Hi!, I'm Sumit.


I'm a Computer Science graduate from Pennsylvania State University with a degree of Master in Science I have worked in related fields like deep learning in the NLP/NLG domain. For the past few years, I have been motivated by the following question: What are the powers of data-driven approaches in multidisciplinary domains of artificial intelligence? This question led to independent lines of work that include applying deep learning methods to solve research problems for healthcare analytics, text processing, and social network analysis. Meanwhile, I enjoy collaborating with researchers with different backgrounds on interdisciplinary research such as question answering, text summarization, health and social media analysis. Any kind of ideas and suggestions are welcome here.

My research papers are proceeding with NeurIPS MLPH 20, ICDAR 20, CODS COMAD 21, ICON 20, and Computational and Mathematical Organization Theory(Springer)

When I'm not working, I like playing soccer and listening to music. Currently, I have been looking for a full time role in the filed of Machine Learning and Natural Language Processing.

What I Do


    4th semester
  • Data Driven Design
  • DBMS (CMPSC 431W)

NLP and Data Science

I am generally interested in deep learning. I am particularly excited about text summarization and text generation in NLP. My work has been applied to learning dynamical systems in sustainability, social media analysis, low resource language and making robust learning models.


Paper presented at 34th NeurIPS 2020, ACM- 8th CODS & 26th COMAD 2021, CMU IdeaS conferences, and 17th ICON 2020.

  • Black Lives Matter Twitter dataset:

Awards & Achievement

  • Selected for the ”InSc Young Researcher Award”
  • Received NeurIPS 2020 grant support for MLPH workshop



  • Python
  • Tensorflow
  • Pytorch
  • NLTK


  • checkmarkGit + Github
  • checkmarkTerminal
  • checkmarkVscode
  • checkmarkLatex
  • checkmarkGCP/ Google Colab


  • checkmarkNLP
  • checkmarkData Science
  • checkmarkDeep Learning
  • checkmarkSOTA Models


Thanks to the open source learning community for providing valuable resources

Data Science Intern

University of Montpellier, LIRMM(CNRS), France, Advisor-> Prof. Vincent Berry

Developed an inference program for biological data set allowing users to input their data into the program for visualization. Scripted file format transformation using data mining methodology and dealt with new types of databases. Implemented various interfacing programs with the Galaxy web platform. Wrote inference analysis programs to summarise the programs finding by statistics and graphics to predict the likelihood of ancestors in a phylogenetic tree.

May 2019 - July 2019

Undergraduate Research Assistant

UC Berkeley, Advisor-> Prof. Kurt Keutzer

Worked on low resource language Sanskrit for word segmentation task. Collected and pre-processed sentence-pair data in SLP1/IAST format to create an embedding vector.

May 2020 - Dec 2020

Undergraduate Research Assistant

Carnegie Mellon University, Advisor-> Prof. Kathleen M. Carley

Worked on a fine-grained classification and inference task for Tweets misinformation erupted during the Covid-19 pandemic. Crawled the web and collected data for misinformation from Twitter using Tweepy library. Developed and fine-tuned several deep learning language models for misinformation prediction task. Published the work as the first author in a special issue of springer named “Computational and Mathematical Organization Theory on disinformation”.

June 2020 - Dec 2020

Project Research Intern

Samsung R&D

Collaborated on intent detection within a social media application as part of the on-device AI Natural Language Processing (NLP) team. Achieved high-accuracy classification of content for various action-based events using resource-efficient models, facilitating deployment on mobile devices for enhanced user experience.

Jan 2021 - May 2021

AI Intern


Engineered a T5-large deepspeed model specifically for extended abstractive summarization, resulting in a remarkable 70% reduction in manual labor and a notable 56 percent increase in summarization speed. Fine-tuned the T5-large samsum deepspeed model on proprietary data to optimize its performance for the downstream task of contextual summarization. Enhanced pipeline efficiency by strategically segmenting longer input sequences into smaller chunks using Spacy, ensuring effective context preservation throughout the summarization process and timestamps.

June 2021 - August 2021

Graduate Research Assistant


Elevated the accuracy of negation claims identification by an impressive 20%, leveraging features extracted from 28,515 domain-specific research papers through the implementation of Spacy. Contributed to citation intent classification by applying transformer-based pre-trained models, showcasing proficiency in advancing classification tasks within research-oriented contexts.

June 2022 - August 2022

Summer ML Associate

Penn State University

Extracted features from a pool of 400 papers accessible on Medarxiv and Bioarxiv by implementing a robust feature extraction pipeline. Engineered machine learning (ML) and deep neural network (DNN) models, complemented by exploratory data analysis (EDA), to predict the likelihood of papers being published or having a significant impact.

June 2022 - August 2022

Graduate Research Assistant


Established a Vision-and-Language Navigation task, empowering agents to acquire navigational skills in a visual environment through training on 22,000 multilingual instructions. Engineered a graph-based visualization system for tracking and analyzing barter trades within a customized Minecraft environment. Published research findings in IOS Press HHAI 2023, contributing to the dissemination of valuable insights in the field.

August 2022 - May 2023

Lead Machine Learning Engineer

Rock Analytics

Spearheaded the development and deployment of a state-of-the-art AI platform tailored for portfolio managers, ESG investors, operational managers, and market risk managers, integrating advanced AI models such as LLM to elevate decision-making and streamline business operations. Demonstrated expertise in utilizing Databricks clusters to extract data from S3, leveraging PySpark for extensive big data transformation and manipulation, and deploying LLM for enhanced feature extraction. Employed web crawlers to collect ESG data, seamlessly integrating with S3 for efficient data handling and processing. Achieved a 62 percentage optimization of the SEC EDGAR web crawler and indexer, enhancing efficiency for processing 576,479 files. Implemented retriever and indexing mechanisms for semantic search (Solr Lucene), specifically for S3 objects containing both images and texts, contributing to improved data retrieval capabilities.

October 2023 - Present
My Resume here view resume

Some of my work & contributions

desktop-screenshot tablet-screenshot

Automated Medical Assistance

We aimed to mitigate the disparity of access to Telehealth among different racial groups. To design and experiment on different transformers based model like BERT, BART, and GPT2, we could generate a doctor response through question answering. The paper “Automated Medical Assistance Attention Based Consultation System” was accepted at NeurIPS MLPH workshop 2020.

  • Tensorflow
  • Hugging Face transformer
  • Python Reddit API Wrapper
  • NLTK
desktop-screenshot tablet-screenshot

Misinformation in COVID-19 tweets

We aimed at developing a classifier for fine-grained inference of COVID-19 tweets. Through building a benchmark system for misinformation mitigation and experimenting with classical machine learning approaches to the latest state-of-the-art deep learning models gave me a deeper insight into natural language processing. I presented the work “A Fine-Grained Analysis of Misinformation in COVID-19 Tweets” at the IdeaS conference which got accepted in a special issue of springer named “Computational and Mathematics Organization Theory”.

  • Tensorflow
  • Hugging Face transformer
  • Python Tweepy API Wrapper
  • NLTK
desktop-screenshot tablet-screenshot

Biomedical Network Link Prediction using Neural Network Graph Embedding

In this paper, we aim at Graph embedding learning for automatic grasping of low dimensional node representation on biomedical networks. The purpose is to use different neural Graph embedding methods for conducting analysis on 3 major biomedical link prediction tasks drug disease association (DDA) prediction, drug drug interaction (DDI) classification, and protein protein interaction (PPI) classification. We observe that graph embedding method achieve a promising result without the use of any biological features. The paper got accepted at CODS-COMAD 21 in Young Researchers Symposium Track.

  • OpenNE
  • Numpy
  • Scikit
  • Graph Embedding
  • Variational Autoencoder
desktop-screenshot tablet-screenshot


In this paper, we have designed a character level pretrained language model for extracting support phrases from tweets based on the sentiment label. We also propose a character level ensemble model designed by properly blending Pre-trained Contextual Embeddings (PCE) models RoBERTa, BERT, and ALBERT along with Neural network models RNN, CNN and WaveNet at different stages of the model. The paper got accepted at 17th International Conference on Natural Language Processing [ICON 2020] .

  • TensorFlow
  • Transformer
  • Tweepy
  • NLTK
Google Scholar Google Scholar

Get In Touch

Feel free to send me a message

Visitors since 20 Jan 2021,
Flag Counter