How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews? (REFSQ 2025 - Research Track)

Who

Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma, Dietmar Pfahl

Track

REFSQ 2025 Research Track

Time Zone

The program is currently displayed in (GMT+02:00) Brussels, Copenhagen, Madrid, Paris.

Use conference time zone: (GMT+02:00) Brussels, Copenhagen, Madrid, ParisSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 8 Apr 2025 15:00 - 15:30 at C2 - Sala Actes - Research Track - Session R3 - Crowd and Large-Scale RE II Chair(s): Oliver Karras

Abstract

[Motivation] Automatic analysis of user reviews to understand user sentiments toward app functionality (i.e. app feature) helps align development efforts with user expectations and demands. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model’s parameters i.e., using zero or a few labeled examples, but the capabilities of LLMs are yet unexplored.[Problem] The goal of our study is to explore the capabilities of LLMs to perform feature-level sentiment analysis of user reviews.[Method] This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and different variants of Llama-2 chat, against previous approaches for extracting app features and associated sentiments in 0-shot, 1-shot, and 5-shot scenarios.[Result]The results indicate that GPT-4 outperforms the rule-based SAFE by 17% in f1-score for extracting app features in the 0-shot scenario, with 5-shot further improving it by 6%. However, the fine-tuned RE-BERT exceeds GPT-4 by 6% in f1-score. For predicting positive and neutral sentiments, GPT-4 achieves f1-scores of 76% and 45% in the 0-shot setting, which improve by 7% and 23% in the 5-shot setting, respectively.[Contribution] Our study conducts a thorough evaluation of both proprietary and open-source LLMs to provide an objective assessment of their performance in extracting feature-sentiment pairs.

Faiz Ali Shah

University of Tartu, Estonia

Estonia

Ahmed Sabir

Institute of Computer Science, University of Tartu, Tartu, Estonia

Estonia

Rajesh Sharma

Institute of Computer Science, University of Tartu, Tartu, Estonia

Estonia

Dietmar Pfahl

University of Tartu

Estonia

Time Zone

The program is currently displayed in (GMT+02:00) Brussels, Copenhagen, Madrid, Paris.

Use conference time zone: (GMT+02:00) Brussels, Copenhagen, Madrid, ParisSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 8 Apr
Displayed time zone: Brussels, Copenhagen, Madrid, Paris change

14:00 - 15:30	Research Track - Session R3 - Crowd and Large-Scale RE IIResearch Track at C2 - Sala Actes Chair(s): Oliver Karras TIB - Leibniz Information Centre for Science and Technology

14:00 30m Talk		FeReRe: Feedback Requirements Relation using Large Language ModelsEvaluation Paper Research Track Michael Anders Heidelberg University, Barbara Paech Heidelberg University
14:30 30m Talk		How Does Users' App Knowledge Influence the Preferred Level of Detail and Format of Software Explanations?Technical Paper Research Track Martin Obaidi Leibniz Universität Hannover, Jannik Fischbach Netlight Consulting GmbH and fortiss GmbH, Marc Herrmann Leibniz Universität Hannover, Hannah Deters Leibniz University Hannover, Jakob Droste Leibniz Universität Hannover, Jil Klünder University of Applied Sciences \| FHDW Hannover, Kurt Schneider Leibniz Universität Hannover, Software Engineering Group
15:00 30m Talk		How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?Evaluation Paper Research Track Faiz Ali Shah University of Tartu, Estonia, Ahmed Sabir Institute of Computer Science, University of Tartu, Tartu, Estonia, Rajesh Sharma Institute of Computer Science, University of Tartu, Tartu, Estonia, Dietmar Pfahl University of Tartu