Automation of Chat GPT and Data Analysis Tasks

Automation of Chat GPT and Data Analysis Tasks

On this page

by Sidney Yang (CEO of HEARTCOUNT)

Let’s try to modify the sentence of poet Kim Soo-young, “The new frontier of poetry is a place where poetry is not needed.” to “The new frontier of data analysis is a place where data analysis is not needed.”

ChatGPT builds a huge map of the accumulated thoughts, knowledge, patterns, and structures of linguistic expressions of humanity as a language model, showing “excellent linguistic ability,” and making us reflect on the essence of human language and the changes that the new frontier of AI technology will bring to our lives.

To talk about the changes that ChatGPT will bring to specific work areas, it will require a reckless act of talking about the entire plot by only watching the teaser trailer of a feature-length film.

Assuming that this article may be proven to be a cautious and inaccurate one that cannot clearly grasp the future, we will discuss the possibilities and limitations of ChatGPT in data analysis, especially in decision-making tasks using data.


Automation of Data Analysis Tasks

Prior to the introduction of ChatGPT, there have been attempts and achievements to automate various tasks in data analysis.

In the data analysis process of “question-data access-analysis-report” as shown in the table below,

  1. Question: Automatically detect and notify sudden changes in management indicators or
  2. Analysis: Automatically discover patterns that can optimize indicators or
  3. Report: Summarize visualization charts in text

Efforts to automate have been made by software companies that develop data analysis tools in these areas.

By the way, HEARTCOUNT, developed by the company I work for, is also a data automatic analysis tool that helps automate these three areas.
Question — Data Accessibility — Analysis(Finding Insights) — Report Process

The following discusses whether the automation features or technologies of data analysis tools will become obsolete with the emergence of ChatGPT, or whether they will form a cooperative and augmenting relationship in utilizing data between people, data tools, and ChatGPT.

Question: ChatGPT, no need for you here.

  • In the context of decision-making data analysis, “asking questions” itself is not a big problem (difficulty) in most cases. The questions that data is expected to answer are usually self-evident (why did it go up? Why did it go down?) with respect to a company’s business metrics (Business Metrics; KPI).
  • Of course, if you want to look at data in a haphazard manner without a specific analysis purpose, ChatGPT might provide guidance on common analysis topics and methods. However, if the purpose is not to study data analysis but to solve problems through data, I think ChatGPT’s contribution in the question area is not significant.
  • The following figure shows ChatGPT’s response to which questions to ask after presenting the columns (variable names) of the dataset.
Ahem, ahem….

Data Access/Retrieval: Can you write using ChatGPT instead of SQL?

  • ChatGPT is already known to perform at a satisfactory level in writing SQL, which is a language used to extract data from databases, and fast-moving companies as shown below are already utilizing it in their products. (Prompt example of translating natural language to SQL)
Image from Hyperquery
  • However, it seems difficult to rely on ChatGPT to understand very complex schemas (hundreds of tables interwoven and connected) and automatically generate SQL. For reference, the following table shows the percentage of database administrator tasks predicted to be exposed to ChatGPT’s augmentation or displacement risk (not only ChatGPT but also the application programs that use it) in Open AI’s latest paper, “An early look at the labor market impact potential of large language models.” The beta value (the risk of being affected) shows that about 50% can be affected.

Analysis: ChatGPT, no thanks.

  • According to the GPT-4 analysis (as of April 9, 2023), even for datasets with very few records, less than dozens, basic aggregation and statistical analysis tasks such as average, total, correlation, and drill-down (total sales by product) were not accurately performed. As shown in the figure below, the actual correlation coefficient is 0.42, but it gives a wrong answer of 0.12.
  • This type of calculation error, as stated in ChatGPT’s response below, is also recognized by ChatGPT itself as a inherent limitation of the LLM model. Furthermore, this can be easily resolved in the future by providing calculators to ChatGPT through plugins or similar methods.
  • However, it seems that data analysis tasks in the ChatGPT platform will evolve into a form where ChatGPT’s help will be obtained through APIs within specialized data analysis tools such as Excel, R, etc. rather than performing serious data analysis within ChatGPT.

Analysis, What if ChatGPT is combined with Data Tools?

  • Although its exact nature has not yet been revealed, according to the promotional material for MS 365 Co-pilot, using Co-pilot in Excel allows you to analyze trends in the data and create stunning data visualizations in seconds. (Link for reference)
“With Copilot in Excel, you can analyze trends and create professional-looking data visualizations in seconds.”
  • If you request “analyze this quarter’s business results and summarize three trends”, it will show you the analyzed results in natural language on the right side of the picture below.
  • The operating method is probably as follows: 1. Mapping “business results” to the “sales” column, 2. Creating a pivot table by categorical variables (customers, products) in Excel, and 3. Using a chart/table interpretation model that has learned from data tables and summarized data in text to show the summarized results in text.
  • However, it is skeptical how useful it is to randomly select and display some of the machine-inferred patterns without knowledge of the context in which the data was collected to solve the issue of improving indicators with data.
  • Instead, I think that an algorithm that calculates and shows the major factors contributing to the change in indicators based on the absolute or usual contribution to the change (Surprise Factor) for all variables in advance, like HEARTCOUNT’s “Metrics Change Explainer” function, is a convenient and advantageous way for users of data tools (we prepared everything because we don’t know what you’ll like…).
  • ‘Explainer’ feature of HEARTCOUNT

Insight Reporting/Sharing

  • Data reporting is the process of quickly discovering useful information related to questions from data (the results of analytical work) and including it in a report with one’s own opinions. If we categorize the process more schematically, it can be divided into “quantitative fact confirmation for various hypotheses” and “knowledge production through interpretation of discovered facts.”
  • Although it cannot be asserted, it can be said with a fair amount of certainty that the latter, “knowledge production through interpretation of discovered facts,” will still be the responsibility of humans in the near future.
  • In the end, the area where the chatbot GPT can provide practical help with data reporting is probably in writing more refined sentences based on quantitative facts discovered by humans or professional data tools. For example, when insights on factors contributing to a decrease in sales are given as text, it would be expressed in the following way, as shown in the image below.
  • “In April 2023,” the “Total Sales” increased by KRW 10 million (12%) compared to the same month of the previous year (April 2022), from KRW 60 million to KRW 70 million. When looking at the decrease factor at the “Sub-Category” level, the “Chairs” had the largest change from KRW 15 million to KRW 7.5 million (50% decrease), followed by “Tables”…

If you feel like you’re just talking on the surface while holding onto each other’s tails, stop the conversation and look into the other person’s inner thoughts instead. If you understand the inner workings of Chat GPT, which is designed to predict the next word based on the probabilistic relationship between words, and realize that it gives inaccurate answers to questions that require mathematical operations, you may feel ashamed of arguing with it.

In many cases, data analysis topics are about the most important and complex problems our company faces. Machine cannot provide decision makers with empathetic and trustworthy answers to such questions in the near future. We are constantly thinking and experimenting on how people, data tools, and Chat GPT (LLM) can collaborate and become augmenting relationships in discovering value from data, and we hope to introduce the results to the world this fall.

🙌
If you want to keep up with related news or talk to professionals who have similar concerns about various data-related topics, please join our data community, DDMA(Don't Data Me Alone).