Cloud Service Center

Automated data extraction using Logic Apps

In the digital dunes of our inboxes, documents drift in like grains of sand. We store, open, and go through them, extracting the data spice to other systems. This article will be your guide on how to automate this process, teaching you to mine data from PDFs and deposit it into Excel Online

 

While the ideal scenario would be to initiate this process immediately upon receiving an email, it requires Exchange Online. For the purposes of this tutorial, we’ll begin with the assumption that the file is already saved in OneDrive.

 

Please note that this is not a generic solution, this is a specific solution and it will work for invoices from the same supplier with the same format. Also note that the RegEx expressions would need adjustments for covering amounts bigger than 999.99 CHFs.

 

Getting Started

 

  1. Azure Subscription: If you haven’t subscribed to Azure yet, now’s the perfect time.
  2. Encodian Flowr API Key: This is essential for processing the PDF OCR. For a free Encodian Flowr account, register at Encodian Flowr. This grants you 50 free OCR actions per month. Alternatively, you can opt for base64.ai, which offers 100 free document processing actions monthly. For this tutorial, we’ll use Encodian and use JavaScript and RegEx.
  3. Resource Group: In Azure, our resource groups adhere to the naming convention RSGR_[DEPARTMENT/INSTITUTE]_[PROJECT]. For this project, I’ll name it RSGR_ETHZ_IDPPF_AUTOMATION_LOGIC_APPS. Additionally, I’ll tag it with CostCenter, which in my case has the value 28057.

 

 

Because our Logic App will need to run Javascript, we need to create an Integration Account to execute this code. Don’t worry, you don’t need to know Javascript, we will use ChatGPT, it knows how to create perfect code for this use case.

 

Make sure you take the Free Tier, this will enable you to run your flow thousands of times for free. Once your logic apps runs millions of times per month, you will need to upgrade your Integration Account, but until then you can enjoy it for free.

 

 

 

 

Keep in mind that every time this automation is executed, you’re generating savings for ETH. Even if you invest a few CHFs to run the automation millions of times, the productivity gains for ETH would amount to several million. So, don’t get worry about the cost and go with the flow (pun intended).

 

 

Let’s dive into constructing our Logic App. To begin, I’ll establish a new Logic App and assign it a name. Given that this Logic App is tailored for automating data extraction from a Software One invoice, I shall name it Docs-Ethz-SoftwareOne-Extraction.

 

 

When selecting a plan, opt for ‘Consumption’. This pricing model offers the first 4,000 actions at no cost, with a nominal fee of $0.000025 for each subsequent action.

 

 

 

Now let’s create the steps of the Logic App in the Logic Apps Designer.

 

 

For the trigger, lets use “When a new file is created on OneDrive”, for the scheduling let it run once per day. Choose the folder in OneDrive which you
want to track, for example /Invoices

 

 

Then we will use Encodian to extract the text  layer of the PDF. You will need your API key to use this service

 

 

Setting Up the Trigger:

  1. For the initiation trigger, select “When a new file is created on OneDrive”.
  2. Schedule the automation to run once daily.
  3. Specify the OneDrive folder
    you wish to monitor. For instance, you can choose /Invoices.

 

Encodian:

 

Next, we’ll employ Encodian to extract the text layer from the PDF. You will need your API key to use this service

 

 

 

 

Give a name to this action without spaces, for example GetPDFTextLayer instead of the
default name.

 

 

 

For easier debugging, consider storing the result in a variable. Use the “Initialize variable” action and assign the extracted Text Layer to it. Note: While step can help  debugging, we won’t be using this variable anywhere else.

 

 

 

 

Testing Your Progress

Now it’s a first good moment to save and test the work that you have so far.

 

 

  1. Click “Save”.
  2. Select “Run Trigger” located at the top of the screen.
  3. To initiate the Logic App, copy a PDF invoice into your /Invoices folder on OneDrive. If you’re using the OneDrive laptop client, allow up to 30 seconds for the Logic App to activate, as the file first needs to be uploaded to the cloud.
 

 

 

 

 

 

We will add the Javascript inline code. First we need to link the Logic App to the Integration Account.

 

 

 

 

 

 

That was easy. Now your Logic App is infused with the ability to run Javascript code. So let’s go back to the Logic Apps designer to continue.

 

 

The OCR Text Variable contains all the text of the PDF which we can inquire using RegEx and Javascript.

 

 

Go back to the designer and add an Inline code operation:

 

 

 

This step is very important and it took me quite some time to figure out the exact way to access the result of the previous action, which has the path workflowContext.actions.GetPDFTextLayer.outputs.body.TextLayer.

 

Knowing this, modify the default code to:

 

 

const text = workflowContext.actions.GetPDFTextLayer.outputs.body.TextLayer;

 

 

Result:

 

 

 

If you want to test it again you won’t need to copy the file to OneDrive again, you can click on the previous execution and resubmit it, this will save you time.

 

 

 

 

 

Now you can start working with Javascript, RegEx and ChatGPT. Copy the result text layer value from with the OCRText variable or Javascript execution outputs to ChatGPT.

 

 

The prompt can be “I want to create logic app inline code using RegEx to extract information from the following text: ” followed by your text.

 

ChatGPT will analyse the text and come up with a few suggestions. Next, tell ChatGPT what you want to extract, for example the prompt could be “I want to extract the total amount in CHFs, the date and the invoice number”. You might have to go through a couple of iterations with ChatGPT to get what you want.

 

 

Along with a usual explanation, it will produce code like this:

 

 

  

 

 

Copy this code after the first line and paste into the Javascript code, it will look like this:

 

 

 

Now, let’s run it, seriously, no fear! Paraphrasing Frank Herbert’s Dune, Fear is the automation killer.

 

 

 

After running it, you should now have data to work with:

 

 

 

 

For using it, you will want to parse it with a JSON data Operation and define the schema.
You can use the result of the previous step to generate the schema

 

 

 

In the next step, you can use the data for example to create data for an Excel Table

 

 

 

 

 

After running it, my result Excel table looks like:

 

 

 

If you want different formats for date or currency, you can set that directly in the Excel table and Logic Apps will not be affected.

 

 

 

And we are done for today! Just like the sands of Arrakis shift Dune, so too have we navigated the dunes of automation with Logic Apps. Remember, in the world of Logic Apps, the one who controls the flow of data, controls the universe. Until next time, may your logic app always run as smoothly as a sandworm through the desert!