Llama | ChatGPT as OCR Vision document AI
작성자 정보
- 작성자 bryanai
- 작성일
컨텐츠 정보
- 조회 628
본문
https://www.youtube.com/watch?v=XiepVnkDKmc
Llama | ChatGPT as OCR Vision document AI
Here is a summary of the "Llama | ChatGPT as OCR Vision Document AI" video:
1. **Introduction to OCR and Vision AI Models**: The speaker discusses using Llama and ChatGPT as OCR (Optical Character Recognition) and Vision AI for document processing, providing tutorials for setting up on both Ubuntu and Windows environments.
2. **Document Processing Use Case**: The demonstration covers automating document reading and analysis for tasks like expense tracking and invoice processing, using screenshots or scanned images.
3. **Demo of OCR Application**: The speaker shows how to use Llama or ChatGPT to read and analyze content from a screenshot (e.g., stock information) and parse numeric data accurately.
4. **Prompt Engineering**: Various prompts are used to dynamically query extracted data, such as asking for the "average volume" from stock data, showing the importance of tailored prompts for accurate AI responses.
5. **JSON Output for Data Storage**: Both Llama and ChatGPT can output structured data (e.g., JSON) for easy integration into databases, making it practical for real-time business automation.
6. **Cost-Effective Custom OCR Solutions**: The speaker emphasizes building inexpensive, in-house OCR solutions using open-source tools (e.g., PyTesseract) instead of expensive, proprietary Vision AI solutions.
7. **Script Automation Workflow**: The video outlines a workflow that triggers scripts when files are uploaded, using open-source libraries like Pillow and Selenium for screenshot capture and data extraction.
8. **Dynamic Prompt Testing with Llama and ChatGPT**: The speaker explains how to modify prompts and compare outputs across different models, underscoring prompt engineering’s role in refining AI results.
9. **Live Demo of Model Accuracy**: The demonstration reveals the models’ ability to handle data in complex document formats accurately, including creating tables and extracting specific values.
10. **Resource Access**: All code, resources, and previous tutorials are available on the speaker's GitHub, with guidance to further automate business tasks using AI and OCR models.
Llama | ChatGPT as OCR Vision document AI - YouTubehttps://www.youtube.com/watch?v=XiepVnkDKmc
Transcript:(00:01) [Music] Hello friends welcome to python RP automation series blog in my very first video I covered how to download and use llma models and model weights or Ubuntu machine in second video I cover related documentation how to do the same thing on Windows environment and also show me a use case how you can use this model to monitor employee attendance and expenses in this video today I'm going to show you another very interesting use case I've been working on using language models as OCR and vision AI for my documents so let me browse to my GitHub(00:36) repository and I want to call out that all the links are included in the video description below now before we jump on to the details about the implementation let me show a quick demo of this application and I will also give you an overview about what you are going to see so that you can decide whether to watch more or skip this video alright so our overall objective is to build an inexpensive OCR and vision AI on some real life production like data so these are the steps we are going to follow let's assume that one of your employees(01:07) submit their expense sheet and uploaded any receipt or you receive inventor invoice or it could be any document so what you want to do you want to read the content of the document and you want to fully automate the whole thing so what we are going to do as soon as you receive that particular document or image you want to call in a script and read the content from that particular image now here I'm going to show you an example very funny example suppose you want to read it from a screenshot and it could be anything so here just for the(01:31) fun I'm reading the content of an you know Apple stock today so and I'll show you how to build that thing so it could be in a screenshot it could be any document so as you can see this is a very complex document and there are a lot of information embedded in this one please pay attention to this this is a screenshot of a web page so the stock price has all sort of numbers are out there and one of the reason I'm using this because there are so many numbers I want to test the lldma or maybe charge GPT how it's going to figure out the exact(01:56) numbers so once we read the content of that particular image or web page what we are going to do we are going to build a dynamic prompt and we'll pause this prompt dynamically to call your preferred language model so for example llma or chat GPT and you will be surprised to see that both of them were very very accurately able to predict that so for example in all sort of those numbers I ask one question respond in one word the average volume of the stock in this text and you'll be surprised I was actually amazed to see that it was(02:23) very accurately able to predict that you know that exact number so for example 70 million or something like that that was the average volume of the stock and also I said hey can you prepare some kind of a table from that so and it was able to you know all of the salt of those numbers it was able to break it down and create a very nice key value Json kind of a text here which obviously you can store that in a database because you know it's easy to read the Json values here so that's we are going to achieve in this video today uh so enough with(02:54) the demo let's go head out to our code now all right now since I got your attention and you are still watching this let me formally introduce myself my name is Amish Shukla and I train neural networks and finance supply chain Healthcare data to predict useful patterns most of the work you see in my GitHub repository is the result of my effort to predict supply chain shortage especially for healthcare during pandemic now in this video today we are going to build an inexpensive OCR keep on using the word inexpensive I don't want to use the word(03:20) say cheap because sheep sounds too cheap now you might wonder why should I develop and work on another OCR Vision AI solution when there are thousands and thousands of you know available option in the market so most of the offerings you see in the market they are actually the wrappers and build around the open source oci package and in this video today I'm going to use the same OCR package and I could rebuild my own OCR Vision AI Library here you may also ask there are a lot of blood organization they offer the vision(03:47) AI as well but those I found personally I find them very very expensive so and those are not actually trained on my real data so this use case that particular use case what I'm showing you that was without any fine tuning on existing knowledge base so if you pretend that if you tune those models and model base on your knowledge base that mean on your document I bet you were going to see the results which are 10 to 20 times or maybe 100 times better than these Vision AIS and once you are using your in-house knowledge base to train these models you(04:19) can definitely you know you can use this with an air on like you know different use cases for example you can use it as a document classifier a digital private signature or scanning the confederated information like Phi or Private health or personal data on your document so there are a lot of like secured information and contracts and expense obviously you know those are your organization related document and you don't want to throw it on the internet so that's why you know you can download this llma model you can train those(04:45) models on your in-house knowledge base and you are going to you know see better results uh those are like thousand times better than using any external Vision AIS so let me do a Code walkthrough and assuming that you want to automate the entire business operation so as soon as the file is received you want to take an action on that so in this step I'm you know this quote what you can do so for example somebody does an STP or maybe a file is uploaded by any means so for example one of your employee drops a file to your SFTP location here so as(05:17) soon as the file is put this is the Linux code you can use to put a file into a folder and as soon as the file is received or maybe you taken a screenshot of image now this code what it does as you can see I'm using uh pillow Library here and selenium both are open source absolutely free of the cost packages here so let me and let me show an example here this is the URL again you can change the url as per your business requirement here I'm just use taking a screenshot of an apple image here now if you want to go through the details of(05:45) this particular code what it does I have covered the entire things in this python automation scripts in previous videos please go through this video and here in this video I have discussed the entire source code line by line How I build this code and this code what are you seeing today is just a copy of that particular code what I use earlier so again what this code does it takes a URL and it goes to that website it takes a screenshot and as soon as the screenshot is taken you want to take that screenshot and you can download to a(06:14) local PNG file so let me execute this foreign as you can see this screen so I am taking the you are not seeing the full picture here but what it does in the background it takes the picture of the entire web page and is going to save it to a file called apple.png now you want to write another script as soon as the file is dropped you want to execute your automation script that you want to trigger another script here again this is the code I have discussed that in my previous videos how to write you know file drop you know how to write code(06:47) which does a crown job or maybe you know which helps you you know as soon as a file is uploaded to end folder and how to execute another subsequent script bridge on that please go through this video and it will definitely help you do the entire Code walkthrough and I'm just taking that code copying it here so what that code does again just to recap as soon as the file is dropped so as soon as the apple.(07:11) png file is created in the download folder I want to call another script here and I want to you know this script as you can see it cheeks it checks that folder every 10 seconds all right now I want to read the text from the images same thing I have already covered this in previous videos that's the reason I created that whole RP automation Series in the past I've covered all of this entire steps line by line and these are the mini Snippets of the Python code so go to my RPA repository one more time and here you will find the detailed walkthrough of this particular code what(07:42) it does but basically long story short it uses a pi Tesseract uh OCR library or OCR package here and it takes whatever content you pass into that so you pass is an image what it does it takes the image and reads the content of the image into text so as you can see I defined a function here and I'm passing that apple.(08:03) png file to this particular function let me call this function what this does it takes that screenshot what we have created it's reading that particular image and it's just capturing the content of that particular file all right next step is what we want to do so now we have read the text content from that particular file particular image or document now it's time to build the prompt so whatever we have captured in that test text variable here that means content of that particular image we are going to build a dynamic prompt now prompt here I am keeping it very very(08:36) simple I just want to ask one question or maybe I want to I want an explanation so I'm building two different prompts here I'm saying you know what Define in one word what is the average volume of this particular average volume of the stock in this text see and then I again in the previous video I have covered this I Define two different functions one calls the chart GPT and other calls the llma please go through these videos that's why I have covered all of these details in previous videos so that I can build upon it so these functions as you(09:05) can see very simple llama again in previous video this is simply you know using you know if you download the Llama model you will see a file called example text completion and here in this there is a variable called prompt all I am doing I am just replacing that with the content I just read so simply what I'm going to do I'm going to call that function but instead of the you know I'm just going to I'm going to update this prompt value variable here with the content what I just read from that file all right so and then there are two(09:35) types of prompts I am testing with please be creative create your own prompts and this is entirely about the prompt engineering passing about the passing more information historical statistical information you pass on to your prompts you know more creative you are better results are going to be so again this is a prompt engineering if you are interested like you know just please because you know your data better so you can ask the you know more relevant question here I'm asking one simple word saying here respond in one(10:01) word what is the average volume or what is the average price otherwise the today's price the previous day's price closing price of that particular stock in this text so please play around with this and um what I did you know I asked this question and I was able to you know find you know the answers were very very accurate and both the results so let me call this let me show a live demo on the Llama so llm and I will show you the chart GPT in one one minute let me Define this function okay now let me run this(10:31) so for the first prompt let me print that and first prompt was I'm asking one question what is the average volume um at that particular average volume in that entire what is the average volume of the stock Apple stock price in that entire text so it's going to scan through all the entire text and it finds the absolute value that hey the average volume was 70 million now similarly let me change the prompt to prompt text and here is going to I'm calling that llma asking it to print the details so here if I execute this(11:02) function let me say rest to oops sorry print rest too and it's going to print all the details of that llma function what we just received again you will be you know this is amazingly what is right is already putting the data in a Json a tabular kind of format which I can directly use let me show you a demo because it's easier to show it on the chart GPD interface this has the web interface which is easier to see as you can see you can test it out that you know I pass the entire data analysis I said hey you respond in one word and you(11:34) can describe that and chat GPT is actually it was more accurate but you can again it depends on your machine configuration and what kind of models you are using all right so that's all I wanted to cover in this video I hope you like this video if you have any question please feel free to open an issue lock at my GitHub repository and I'll be happy to you out please subscribe my channel and thanks for watching thank you
Transcript:(00:01) [Music] Hello friends welcome to python RP automation series blog in my very first video I covered how to download and use llma models and model weights or Ubuntu machine in second video I cover related documentation how to do the same thing on Windows environment and also show me a use case how you can use this model to monitor employee attendance and expenses in this video today I'm going to show you another very interesting use case I've been working on using language models as OCR and vision AI for my documents so let me browse to my GitHub(00:36) repository and I want to call out that all the links are included in the video description below now before we jump on to the details about the implementation let me show a quick demo of this application and I will also give you an overview about what you are going to see so that you can decide whether to watch more or skip this video alright so our overall objective is to build an inexpensive OCR and vision AI on some real life production like data so these are the steps we are going to follow let's assume that one of your employees(01:07) submit their expense sheet and uploaded any receipt or you receive inventor invoice or it could be any document so what you want to do you want to read the content of the document and you want to fully automate the whole thing so what we are going to do as soon as you receive that particular document or image you want to call in a script and read the content from that particular image now here I'm going to show you an example very funny example suppose you want to read it from a screenshot and it could be anything so here just for the(01:31) fun I'm reading the content of an you know Apple stock today so and I'll show you how to build that thing so it could be in a screenshot it could be any document so as you can see this is a very complex document and there are a lot of information embedded in this one please pay attention to this this is a screenshot of a web page so the stock price has all sort of numbers are out there and one of the reason I'm using this because there are so many numbers I want to test the lldma or maybe charge GPT how it's going to figure out the exact(01:56) numbers so once we read the content of that particular image or web page what we are going to do we are going to build a dynamic prompt and we'll pause this prompt dynamically to call your preferred language model so for example llma or chat GPT and you will be surprised to see that both of them were very very accurately able to predict that so for example in all sort of those numbers I ask one question respond in one word the average volume of the stock in this text and you'll be surprised I was actually amazed to see that it was(02:23) very accurately able to predict that you know that exact number so for example 70 million or something like that that was the average volume of the stock and also I said hey can you prepare some kind of a table from that so and it was able to you know all of the salt of those numbers it was able to break it down and create a very nice key value Json kind of a text here which obviously you can store that in a database because you know it's easy to read the Json values here so that's we are going to achieve in this video today uh so enough with(02:54) the demo let's go head out to our code now all right now since I got your attention and you are still watching this let me formally introduce myself my name is Amish Shukla and I train neural networks and finance supply chain Healthcare data to predict useful patterns most of the work you see in my GitHub repository is the result of my effort to predict supply chain shortage especially for healthcare during pandemic now in this video today we are going to build an inexpensive OCR keep on using the word inexpensive I don't want to use the word(03:20) say cheap because sheep sounds too cheap now you might wonder why should I develop and work on another OCR Vision AI solution when there are thousands and thousands of you know available option in the market so most of the offerings you see in the market they are actually the wrappers and build around the open source oci package and in this video today I'm going to use the same OCR package and I could rebuild my own OCR Vision AI Library here you may also ask there are a lot of blood organization they offer the vision(03:47) AI as well but those I found personally I find them very very expensive so and those are not actually trained on my real data so this use case that particular use case what I'm showing you that was without any fine tuning on existing knowledge base so if you pretend that if you tune those models and model base on your knowledge base that mean on your document I bet you were going to see the results which are 10 to 20 times or maybe 100 times better than these Vision AIS and once you are using your in-house knowledge base to train these models you(04:19) can definitely you know you can use this with an air on like you know different use cases for example you can use it as a document classifier a digital private signature or scanning the confederated information like Phi or Private health or personal data on your document so there are a lot of like secured information and contracts and expense obviously you know those are your organization related document and you don't want to throw it on the internet so that's why you know you can download this llma model you can train those(04:45) models on your in-house knowledge base and you are going to you know see better results uh those are like thousand times better than using any external Vision AIS so let me do a Code walkthrough and assuming that you want to automate the entire business operation so as soon as the file is received you want to take an action on that so in this step I'm you know this quote what you can do so for example somebody does an STP or maybe a file is uploaded by any means so for example one of your employee drops a file to your SFTP location here so as(05:17) soon as the file is put this is the Linux code you can use to put a file into a folder and as soon as the file is received or maybe you taken a screenshot of image now this code what it does as you can see I'm using uh pillow Library here and selenium both are open source absolutely free of the cost packages here so let me and let me show an example here this is the URL again you can change the url as per your business requirement here I'm just use taking a screenshot of an apple image here now if you want to go through the details of(05:45) this particular code what it does I have covered the entire things in this python automation scripts in previous videos please go through this video and here in this video I have discussed the entire source code line by line How I build this code and this code what are you seeing today is just a copy of that particular code what I use earlier so again what this code does it takes a URL and it goes to that website it takes a screenshot and as soon as the screenshot is taken you want to take that screenshot and you can download to a(06:14) local PNG file so let me execute this foreign as you can see this screen so I am taking the you are not seeing the full picture here but what it does in the background it takes the picture of the entire web page and is going to save it to a file called apple.png now you want to write another script as soon as the file is dropped you want to execute your automation script that you want to trigger another script here again this is the code I have discussed that in my previous videos how to write you know file drop you know how to write code(06:47) which does a crown job or maybe you know which helps you you know as soon as a file is uploaded to end folder and how to execute another subsequent script bridge on that please go through this video and it will definitely help you do the entire Code walkthrough and I'm just taking that code copying it here so what that code does again just to recap as soon as the file is dropped so as soon as the apple.(07:11) png file is created in the download folder I want to call another script here and I want to you know this script as you can see it cheeks it checks that folder every 10 seconds all right now I want to read the text from the images same thing I have already covered this in previous videos that's the reason I created that whole RP automation Series in the past I've covered all of this entire steps line by line and these are the mini Snippets of the Python code so go to my RPA repository one more time and here you will find the detailed walkthrough of this particular code what(07:42) it does but basically long story short it uses a pi Tesseract uh OCR library or OCR package here and it takes whatever content you pass into that so you pass is an image what it does it takes the image and reads the content of the image into text so as you can see I defined a function here and I'm passing that apple.(08:03) png file to this particular function let me call this function what this does it takes that screenshot what we have created it's reading that particular image and it's just capturing the content of that particular file all right next step is what we want to do so now we have read the text content from that particular file particular image or document now it's time to build the prompt so whatever we have captured in that test text variable here that means content of that particular image we are going to build a dynamic prompt now prompt here I am keeping it very very(08:36) simple I just want to ask one question or maybe I want to I want an explanation so I'm building two different prompts here I'm saying you know what Define in one word what is the average volume of this particular average volume of the stock in this text see and then I again in the previous video I have covered this I Define two different functions one calls the chart GPT and other calls the llma please go through these videos that's why I have covered all of these details in previous videos so that I can build upon it so these functions as you(09:05) can see very simple llama again in previous video this is simply you know using you know if you download the Llama model you will see a file called example text completion and here in this there is a variable called prompt all I am doing I am just replacing that with the content I just read so simply what I'm going to do I'm going to call that function but instead of the you know I'm just going to I'm going to update this prompt value variable here with the content what I just read from that file all right so and then there are two(09:35) types of prompts I am testing with please be creative create your own prompts and this is entirely about the prompt engineering passing about the passing more information historical statistical information you pass on to your prompts you know more creative you are better results are going to be so again this is a prompt engineering if you are interested like you know just please because you know your data better so you can ask the you know more relevant question here I'm asking one simple word saying here respond in one(10:01) word what is the average volume or what is the average price otherwise the today's price the previous day's price closing price of that particular stock in this text so please play around with this and um what I did you know I asked this question and I was able to you know find you know the answers were very very accurate and both the results so let me call this let me show a live demo on the Llama so llm and I will show you the chart GPT in one one minute let me Define this function okay now let me run this(10:31) so for the first prompt let me print that and first prompt was I'm asking one question what is the average volume um at that particular average volume in that entire what is the average volume of the stock Apple stock price in that entire text so it's going to scan through all the entire text and it finds the absolute value that hey the average volume was 70 million now similarly let me change the prompt to prompt text and here is going to I'm calling that llma asking it to print the details so here if I execute this(11:02) function let me say rest to oops sorry print rest too and it's going to print all the details of that llma function what we just received again you will be you know this is amazingly what is right is already putting the data in a Json a tabular kind of format which I can directly use let me show you a demo because it's easier to show it on the chart GPD interface this has the web interface which is easier to see as you can see you can test it out that you know I pass the entire data analysis I said hey you respond in one word and you(11:34) can describe that and chat GPT is actually it was more accurate but you can again it depends on your machine configuration and what kind of models you are using all right so that's all I wanted to cover in this video I hope you like this video if you have any question please feel free to open an issue lock at my GitHub repository and I'll be happy to you out please subscribe my channel and thanks for watching thank you
관련자료
-
링크
-
이전
-
다음
댓글 0개
등록된 댓글이 없습니다.