Chatbot for Extracting Information from Company Documents

To build an Ai chatBot Website to automate the process of extracting information from publicly available documents of US-listed companies, such as annual reports, quarterly reports, ESG reports, and PR/News articles. And the chatbot that can provide instant access to this information upon user request.

Challenges:

PDF documents can vary widely in structure and formatting, making it challenging to accurately extract text, images, charts, and graphs.

Extracting specific information such as Diagrams, company performance metrics, and ESG ratings requires advanced data extraction techniques to ensure accuracy.

Processing a large volume of company documents and ensuring the system can increase workload

Integrating various technologies into a coordinated system requires careful planning and implementation to ensure compatibility and efficiency.

My Approach:

Here we need to understand How the pdf parsing works and how the dat can be displayed to our user in a maximum user friendly format

So starting from How to upload the documents:

  • Provide a web interface to upload PDF documents containing annual reports, quarterly reports, ESG reports, and PR/News articles of publicly listed companies.

  • The extracted data will be stored in a database in images,text,extracted data format organized by document type, company, and publication date.

  • A chatbot interface will be developed using the ChatGPT extension. The chatbot will be trained to understand user queries and retrieve information from the database.

  • Users can interact with the chatbot through the interface. They can ask questions or request information about companies, and the chatbot will give answers from the databse.

  • We can provide a feedback mechanism for further improvement.

Initial Research/ Plan

  • How can we process large volume of company documents,and how to categories these data

  • How can we organise these data according to company name,publication date,Document

  • Which techstack can we use( for pdf parsing,API,Chatbot type,Database to store) 

Wireframes

Solution(Assumptions )

  • For PDF Parsing we can Use PyMuPDF( Python library for data extraction)and OpenCV to parse the uploaded PDF documents and extract text, images, charts, and graphs. Store this extracted data in a database.

  • First, we need to have a database set up and configured, Consider Mongodb as our DB, and can insert the extracted data according to their format

  • Text data can be stored as a string

  • For images ,diagrams,and tables it will store as binary format

  • Develop an API that allows the chatbot to query the database for information from user. The API should accept queries from the chatbot, retrieve the relevant data from the database.

  • When a user interact with chatbot eg:asking for company’s company's financial performance in 2023 chatbot will create a query and retrieve data from database using the API.

  • Develop a user interface (chat interface) that allows users to interact with the ChatGPT extension. Users can ask questions about companies.

More by Harisanker

View profile