AzBlobChecker Deployment and Usage Guide

AzBlobChecker Deployment and Usage Guide
You were provided access to AzBlobChecker to scan your Azure Storage account. The scanning process
will examine each of your objects looking for the signs that the file might be impacted by the issue, and
that the object might need to be reviewed.
This document is targeted at the technical operations team that will be performing the scan. Some prior
knowledge of Azure, PowerShell and the Azure CLI is expected.
Contents Getting started .............................................................................................................................................. 3
Deploying the application ............................................................................................................................. 7
Sample Script to deploy to Azure Container Instances (recommended) ................................................. 7
Sample Script to deploy to an AKS Cluster ............................................................................................. 10
Environment Variable Definitions for Sourcerer and Checker ............................................................... 10
Monitoring & Metrics ................................................................................................................................. 12
Auditing the Application ............................................................................................................................. 21
To load the results of Blob Inventory into Azure Data Explorer ............................................................. 21
To load the checker Log Table into Azure Data Explorer ........................................................................ 22
Sample ADX Queries ............................................................................................................................... 22
Other sourcing Algorithms ...................................................................................................................... 24
Troubleshooting ACI containers that do not seem to be doing anything .............................................. 26
How many sourcerers can I run at the same time? ................................................................................ 27
Restarting the sourcing process .............................................................................................................. 28
How many checkers can I run at the same time? ................................................................................... 31
Authentication with SharedKey .............................................................................................................. 31
Getting started
Prerequisites We have designed AzBlobChecker to work with the largest possible number of storage accounts. Please
use the following matrix to identify any known storage account features that AzBlobChecker will not
work with. If you are using these features, please contact your Microsoft team.
Account feature Supported (Y/N)
Blob (Hot tier) Y
Blob (Cool tier) Y
Blob (Archive tier) N – Archive Objects will be queued for checking, but the checker will not check them.
Blob encrypted with Microsoft Managed Key Y
Blob encrypted with Customer Managed Key Y
Blob encrypted with Customer Provided Key N
Storage account has SAS Enabled Y
Storage account has Shared Key Disabled N
Storage account requires Managed Identity N
Storage account requires Service Principal Identity
N
Storage account in Public Cloud Y
Classic (v1) storage accounts Y
Storage accounts GPv2 Y
Y - requires modification of deployment script, not all features supported
Versioning Y – only the current version will be checked
Snapshots Y – only the current snapshot will be checked
Soft delete Y – only non-deleted files will be checked
Components of AzBlobChecker There are six major components of running AzBlobChecker:
Figure 1: AzBlobChecker architecture and components
1. The application takes in a Target Storage account, this is the account you want to scan.
2. The results of the application run are written to the Ops Storage account. This account contains
storage tables and queues used to keep track of and log the progress of the application.
a. Note: The deployment scripts will automate the process of creating this account, and when
the application is run, it will automatically populate any required assets (queues, tables,
blobs) into this account.
3. An Application Insights instance for monitoring progress while the application is running. This gives
you near real-time insights into the progress of the application.
a. Note: The deployment scripts will automate the process of creating this resource.
4. An Azure Workbook that summarizes the key metrics you need to watch while AzBlobChecker is
running.
5. The sourcerer is a .NET core application deployed as a Docker container that will iterate over each of
the objects in your target storage account and place a message in a queue to have the object
scanned. You can add/remove instances of the sourcerer at any time during the sourcing process to
speed up/slow down the sourcing process.
6. The checker is a .NET core application deployed as a Docker container that will scans each object
identified by the sourcerer. You can add/remove instances of the checker at any time during the
checking process to speed up/slow down the checking process.
7. The watchdog is a .NET core application deployed as a Docker container, it monitors the queue
depth and reports the status to Application Insights. You will only need one instance of this running.
8. The Azure Container Registry where Microsoft publishes the Docker containers for you to
download. Credentials to download assets from this registry will be provide by your Microsoft
contact.
When you deploy AzBlobChecker the deployment script will deploy the Ops resource group and all the
resources inside of it. This will contain a Storage account, Application Insights instance, Azure
Workbook, an instance of the watchdog and instances of the sourcerer and checker for each Target
Storage account you want to scan. The resource group will be named as follows “rg-bc-<target storage
account name>”. We recommend this so that the load on the Ops Storage account and the Application
Insights instance is more manageable, and you can add/remove instances of the sourcerer/checker
based on the unique requirements of each Target Storage account.
Note: the above diagram and the process outlined in this doc assumes you are
deploying the Docker containers to Azure Container Instances, however you can
deploy these Docker containers to whatever platform makes the most sense for your
environment, for example you might choose to deploy to the Azure Kubernetes
Service.
The account to be scanned is called the “target account” for purposes within this document.
AzBlobChecker (the tool) should ONLY be given READ access to the target account as it does not need to
modify any data. The tool keeps track of everything it does in an operations storage account and
Application Insights.
AzBlobChecker is designed to work against a single target account. If you need to scan multiple
accounts, you can run multiple deployments AzBlobChecker in parallel, up to your subscription limits.
Scanning a storage account requires 2 steps:
1. Listing all the files in the storage account
2. Checking each file
Each of these steps is completed by a different Docker container, allowing you to add/remove running
instances of each container, reducing the time it takes to scan the storage account and optimizing the
number of instances needed while keeping below the limits of the source storage account.
Listing all the files in the storage account The first step is building a list of all the files that need to be scanned. We call this process “sourcing”,
and it is done by the “sourcerer”. The output of the sourcerer is an Azure Storage Queue, named “<your
account name>-online” for Hot/Cool objects in the target storage account and “<your account name> -
offline” for any Archive objects in the target storage account.
Note: The queues will only be created if objects of that type are found. Checking offline objects is not
supported by the tool at this time.
What does the sourcerer use to track the scanning process?
The sourcerer will also create an Azure Storage Table and Queue named “<your account name>”. This
storage table will provide a durable log of the entire scanning process. Both the queue and the table will
contain one message/record per object found in the target storage account.
How does the sourcerer work?
First each container in the account is listed using the List Containers API and a message is inserted into
the <your account name>-checkpoint queue.
For each container message in the <your account name>-checkpoint queue, the ListBlob API is called to
list all the contents of the container. This call is made using the optional delimiter attribute set to “/”.
This allows searching for any 'folders' in the blob names.
If a folder is found a new message is added to the <your account name>-checkpoint queue telling the
system to scan that folder.
Scanning in this manner breaks the task of listing all the contents of the account into smaller tasks by
container & folder. These smaller tasks can be distributed to many instances of the sourcerer running in
parallel.
How does the sourcerer ensure each object is only sourced one time?
To prevent a target account from being sourced multiple times by accident, a blob lease is taken on a
blob in the operations storage account. The blob is named with your account name and is placed in a
container called “leaderelector”.
How do I know when the sourcerer is done listing all of the files in my target storage account?
The sourcerer is done when the objects per second sourced drops to zero and the sourcing queue depth
drops to zero. See the Azure Workbook deployed into your ops account to monitor these metrics.
Checking each file Now that we have a list of all the objects in the target account, we can talk about checking these
objects.
• The checker pulls a message from the <your account name>-online queue and starts the
checking process on that object.
• If the object is found to have the characteristics of those that need further review, it is flagged
by placing a message in the <your account name>-online-review queue and the <your account
name>review table.
• In the <your account name> table record, each object's status is updated so that you know that
the object was scanned.
• The checkers are done when the objects per second checked drops to zero and the checking
queue depth drops to zero. See the Azure Workbook deployed into your ops account to monitor
these metrics.
you to review the issue and take the appropriate action.
Deploying the application The application is shipped using Docker containers. You can run these containers on whatever platform
makes the most sense for your organization. We have provided sample deployment scripts to ease your
deployment. The scripts contain the details of each deployment step; however, this can change based
on how you deploy. This document includes important configuration and monitoring details that are the
same regardless of how you deploy.
Sample Script to deploy to Azure Container Instances (recommended) To run the script, you must:
• Have access to the subscription where the target storage account is located and permission to
generate a read/list SAS for the account. The deployment script can generate this SAS for you if
you have permissions to the account. If you do not have permissions to the account to generate
the SAS, work with an administrator of the storage account to create a read/list SAS with the
required permissions (see Settings for account SAS for more detail on the required permissions).
• Have access to create the Ops resource group with a storage account, Application Insights
instance, workbook, and Application Container Instances within your Azure subscription.
• PowerShell version 7.1 or better. Run “$PSVersionTable” to check.
o Instructions on how to install PowerShell here
• Azure CLI version 2.29 or better. Run “az --version command” to check.
• Note: The PowerShell version of the Azure Cloud Shell meets the technical requirements for both
PowerShell and the Azure CLI. This allows you to run the deployment scripts from your browser, if
you have the appropriate credentials.
Step 1: Download the script
Execute this command to download the script:
Invoke-WebRequest "https://aka.ms/AAebeuk" -OutFile aci-deploy.ps1
Step 2: Run the script in Interactive Mode
To run the script in Interactive Mode where it will prompt you for all the configuration information,
execute this command:
.\aci-deploy.ps1
• Note: When you run the script for the first time you will be prompted to provide
o Information about the Azure Container Registry where you will get AzBlobChecker from.
These will be provided by your Microsoft Contact.
o # of required sourcerers – We suggest starting with 2 sourcerers
o Details about the target storage account
o Details about where you want the ops resources
Step 3: Monitor the application
After your initial deployment the script will provide you a link to your Azure Workbook dashboard where
you can monitor the application. Pull up this dashboard and let the application run for at least 15
minutes. It takes this long for the telemetry to start flowing into the dashboard.
Step 4: Change the number of running sourcerers or checkers
After the script is run for the first time, it will provide you a parameterized way to re-run the script
without needing to walk through the Interactive Mode again. This makes it easy to re-run the script
when you need to add/remove running instances of the sourcerers or checkers.
You can add instances if you have excess capacity on your target storage account/ops queues or remove
instances if you need to free up resources for other users of your target storage account. You can also
set the number of instances to zero to completely stop them if you need to pause or if that step in the
process has been completed.
The script will also generate a log file, providing you a record of each run of the script you make.
A full deployment to ACI will consist of a resource group with the following:
An ops storage account
An Azure Workbook
One instance of the watchdog
Figure 2: Resources created in default deployment
Sample Script to deploy to an AKS Cluster Download
Environment Variable Definitions for Sourcerer and Checker Note: The below settings assume a default deployment configuration. You can select other options using
the additional environment variables described in the appendix.
Category Setting Notes
Monitoring ApplicationInsights__ConnectionString The connection string for the AI instance the container should log to Note: This is the recommended approach, working in all regions.
Authentication with Shared Access Signature
AppSettings__TargetAuthenticationMode SharedAccessSignature AppSettings__TargetStorageAccountEndpoint Typically
"blob.core.windows.ne t" for target storage accounts in public Azure regions
AppSettings__TargetStorageAccountName The name of your target storage account, without the endpoint (i.e., myaccount)
AppSettings__TargetStorageAccountSas An account SAS (more info). See Settings for account SAS for details on how to create the SAS token with the required permissions.
Operations account info
AppSettings__OpsConnectionString Enter the connection string for the Ops storage account. The Ops storage account is used for any blobs, tables and/or queues the application needs to create.
Sourcerer configuration
AppSettings__SourcingStrategy Queue AppSettings__LastModifiedOnValidationThresho ld
Files older than this date will not be checked. This value should be "2019- 10-31"
Settings for account SAS
Use the following options to create the account SAS, and select the SAS token value for the
AppSettings__TargetStorageAccountSas parameter:
Figure 3: Target account SAS settings
Monitoring & Metrics To keep track of progress, it is essential to understand how the application stack is performing. To
always track application status, AzBlobChecker utilizes Metrics reported through Application Insights.
Each minute, several metrics are being captured and sent as individual time series, enriched with many
object-level dimensions allowing a very flexible way to slice output.
4 main metrics are currently being captured:
• Objects listed is emitted for each object listed from the target storage account, through any of
the provided listing methods and offers the possibility to query for the total amount of objects
and the related total size present in the account.
• Objects to check is emitted for each object that needs to be checked and complies to the pre-
validation and filters provided to the tool. This metric offers the possibility to query for the total
amount of objects to be checked and the related size of those objects present in the account.
• Objects checked is emitted for each object that was ran through the checking logic regardless of
the result of that check. This metric offers, combined with the "Objects to check" metric and
easy way to evaluate progress and checking performance.
• Objects to review is emitted for each object that was ran through the checking logic and that
requires additional validation. This is the most important metric to leverage as it can be used to
evaluate how many objects need manual intervention.
Note: Application insights metrics are lossy (i.e., it might not record every transaction). It is used to
monitor progress of the sourcerers and checkers in a manner that is available to all Azure customers.
Because metrics in Application insights are lossy, it is possible that the aggregates such as “Objects
checked” will not match the number of objects in a scanned account. This is an expected behavior. Please
see Auditing the Application to query the detailed logs.
Note: The Application insights metrics that we are emitting from the application are not real time,
typically metrics arrive within 5 minutes.
Enriched with queryable properties:
• Timestamp with a granularity of 1 minute
• valueCount which is summarizing the total amount of objects represented in a given time
window
• valueSum which is summarizing to total object size represented in a given time window
• valueMin which is summarizing to minimum object size represented in a given time window
• valueMax which is summarizing to maximum object size represented in a given time window
And these (custom) dimensions:
• Storage Account indicating to which target storage account the metric belongs
• Storage Class indicating whether the metric is about objects in an Online or Offline tier
• Access Tier indicating whether the metric is about objects in the Hot, Cool or Archive tier
• Blob Type indicating whether the metric is about an Append Blob, Page Blob or Block Blob
Key Metrics to keep an eye on While the application is running you want to keep an eye of the following key metrics. If you deployed
using the provided ACI Deployment script then all of these queries have been deployed into an Azure
Workbook for you.
Current Status and Estimated Completion
This query provides you an estimate on when the job will complete. When the workbook is deployed the
deployment script queries Azure Monitor to find out the current number of objects and bytes in your
target storage account. When the query is run it pulls the number of files and bytes currently processed
and does a simple linear estimate based on when the process started. Since it is a linear estimate it
doesn’t account for when you scale up/down the sourcerers/checkers or if you have to pause the run for
whatever reason.
The query:
Note: you need to fill in the first three parameters if you are manually running this query. If you
used the deployment script these values will be populated for you when the Azure Workbook is
created.
The results
• Name – this is the name of the metric that you are looking at, see above for a description of
what each mean.
• Storage Account – the target storage account name
• Objects – Total number of objects that have been processed by that metric
• Duration – Time delta between when the last telemetry record for that metric and the first
telemetry record have arrived
• Objects Per Second – Number of Objects processed divided by duration for that metric
• Object Size Per second – Number of bytes processed divided by duration for that metric
• Object Size – Total number of bytes that have been processed by that metric
• Total ETA Objects – Assuming you continue to process at the same speed how long will it take in
total to complete the job, estimated by number of objects.
• ETA Objects - Assuming you continue to process at the same speed WHEN will the job complete,
estimated by number of objects.
• Total ETA Bytes – Assuming you continue to process at the same speed how long will it take in
total to complete the job, estimated by number of bytes.
• ETA Bytes - Assuming you continue to process at the same speed WHEN will the job complete,
estimated by number of bytes.
• Pct Complete Objects – Total number of objects processed divided by total number of objects in
the storage account.
• Pct Complete Object Size – Total number of bytes processed divided by the total number of
bytes in the storage account.
• Start – Date/Time when the first telemetry item for that metric arrived.
• Stop - Date/Time when the last telemetry item for that metric arrived. Good for measuring the
lag between now and the data you are looking at from Application Insights. Also good for
knowing when a task (i.e. sourcing or checking) is complete, i.e. if you are no longer getting
telemetry for 15 minutes or so it is probably done.
Note: We calculate the above on both objects and bytes. Object measures are good for estimating
sourcerer performance/completion (i.e. “objects listed” and “objects to check”). Bytes or Size
measures are good for estimating checker performance/completion (i.e. “objects checked”).
Objects processed per second
This query provides you a time chart of the number of objects processed. This allows you to see the
impact of adding/removing sourcerers/checkers over time.
The query:
Note: you need to fill in the first parameters if you are manually running this query. If you used
the deployment script these values will be populated for you when the Azure Workbook is
created.
The results
Here you can see the blue lines (“objects to check” & “objects listed”) created by the sourcerer
telemetry, and the orange line (“objects checked”) over time. Hovering over a point on the chart give
you the average number of objects processed per second, using a 5 minute window.
Target Account Egress
This query provides you a view of the amount of data egressing out of the target account. This is
inclusive of any egress created by AzBlobChecker and any other users of this storage account. If you pull
data too fast the egress will exceed the account limit and all users can get throttled. Typically you want
to keep egress below 375 GiB per min. If you have excess egress capacity and want to speed up the
checking process you can add additional checkers. If you are using too much egress you can remove
checkers to reduce the load on the target account.
The results
Hover over the chart to see the amount of egress per min.
Ops Account Queue Transactions
This query provides you a view of the amount queue transactions hitting the ops account. Both the
sourcerer and checker contribute to this number, however, typically sourcing too fast is the key driver of
this metric. To avoid throttling you want to keep this metric below 120,000 transactions per minute.
The results
Hover over the chart to see the number of transactions per min.
Sourcing Queue Depth
This query provides you a view of the total number of messages that are in the sourcing queue (i.e.
<target storage account name>-checkpoint queue). This is the number of folders that the sourcerer has
found but has not looked in yet. This number will grow and shrink while the sourcerer is running based
on how the data is structured in the target storage account. When this number is 0 for 15 minutes and
you are no longer seeing any objects processed, then it is safe to shut down your sourcerers.
The query
The results
Hover over the chart to see the number of transactions per min.
Checking Queue Depth
This query provides you a view of the total number of messages that are in the checking queues (i.e.
<target storage account name>-online and <target storage account name>-offline queue). This is the
number of files that the checker needs to process. This number will grow and shrink while the sourcerer
is running based on how the data is structured in the target storage account. When the online queue is 0
for 15 minutes and you are no longer seeing any objects sourced, then it is safe to shut down your
checkers.
The results
Hover over the chart to see the number of messages in the queue at that point in time.
Auditing the Application The Application Insights metrics above are good to get a quick, high-level overview of the progress of
the application. However, the application keeps detailed logs in an Azure Storage Table named with the
target storage account name (in the Ops storage account) detailing what it does to ensure that every file
gets checked.
You can query this table directly if you want to look at the status on a handful of files. However, if you
want to do more complex aggregations/summaries, the recommended practice is copying this table into
Azure Data Explorer.
Note: We recommend waiting until after the checking process is completed to load the data into ADX, as
the loading process will impact the performance of the checking process.
You can load the results of Blob Inventory in Azure Data Explorer. This allows you to quickly identify any
files that might have been missed during sourcing.
To load the results of Blob Inventory into Azure Data Explorer Note: Blob Inventory is not enabled by default on storage accounts, more info here.
This can be done using the “one-click”/lightspeed ingestion tool in Azure Data Explorer. Review the
general process of using lightspeed here. Specific directions for our use case are below.
1. Create a cluster and database, instructions here.
2. Open the “Query window” for your database.
3. Right-click on the database and select “Ingest new data” (this will bring up the one-click wizard).
4. The destination cluster name and database name should be populated. Select “create new
table” and give it a name (i.e., “BI”). Then, press “next”.
5. Source Type should be “from blob container”.
6. Ingestion Type should be “historical data”.
7. Select source – this should point to your target storage account. Use either a SAS URL or the GUI
to select the proper account/container.
8. Under File Filters, set the folder path (i.e., “2021/09/30/12-55-33/inventoryrule”) and the File
extension (i.e., “.csv”).
9. Select any one of your CSVs as the schema defining file and press “next”.
10. On the schema step, check the data types to ensure that they are correct. “Content length” is a
long, “LastAccessTime” is a datetime, “AccessTierChangeTime” is a datetime and ensure that
“ignore first record” is selected.
11. Press “Next”. This will generate a light speed command. Follow the onscreen instructions to
download the lightspeed tool and run the generated command. An example command looks like
into ADX.
To load the checker Log Table into Azure Data Explorer 1. Create a cluster and database if you have not done so already. The instructions are here.
2. In your cluster, create a table to store the data. Execute the query below to create the table.
3. Grant Access
a. NOTE: we are using a service principal, but you can use the configuration best suited for
your environment.
b. Create a Service Principal, instructions here.
c. Give your Service Principal “Database Ingestor” permissions in your DB, instructions
here.
4. Create an instance of Azure Data Factory, instructions here (just this section not the whole doc).
5. Once in Azure Data Factory Studio, on the “Home” tab, select “Ingest”.
6. Select “Built-in copy” and “run once now”. Press “next”.
7. The “Source” is Azure table storage. Create a new connection to your ops account (if needed)
and select the table where the name matches the target storage account name.
8. Press “Next”, and “Next” again when prompted.
9. Now, you will be asked for the target. Select “ADX” and create a new connection (if needed)
using the service principal we created earlier.
10. Next, select the target table that we created earlier (i.e., “checker”). Select “Skip column
Mapping” and press “Next”. Select “Next” again when prompted.
11. You should now be on the settings screen. Here you can leave all the defaults and press “Next”.
12. Then, select “Next” on the summary screen.
13. After some validation and deploying, press “monitor” to watch the progress.
Note: In sample runs, it took approximately 2 hours and 15 minutes to ingest 24 million rows from table
storage into ADX.
Sample ADX Queries With everything loaded you can now quickly query ADX for any information you need.
LightIngest.exe "https://ingest-
Timestamp:datetime, AccessTier:string, Account:string, Checksum:string,
Container:string, CreatedOn:datetime, Endpoint:string, Findings:string,
IsValid:bool, IsValidated:string, LastModifiedOn:datetime, Name:string,
NeedsValidation:bool, Path:string, RunId:string, Size:int64,
Files that need to be reviewed
Note: This list is also in a separate table (“<your account name>-review”). You do not need to load
everything into ADX if this is the only data you are seeking.
Review the total objects in each container from the blob inventory table and the checker log table.
Compare the object counts.
n(".")
| summarize count() by Ext, Type, NeedsValidation, IsValidated, IsValid, Access
Tier
checker
let checker=checker
blobInventory
CheckerCount
Other sourcing Algorithms
• Sequential - Good for very small accounts (under 1 TiB or 100,000 files). While this algorithm is
simple, it can only be executed on one thread, making it inefficient for use on accounts with
many objects.
o First, each container in the account is listed using the List Containers API.
o Next, the ListBlob API is used to list all the contents of the container.
o Set the container AppSettings__SourcingStrategy to Sequential
• BlobInventory - good for the largest accounts where Blob Inventory is available, and scales
based on the number of CSVs generated by Blob Inventory.
o Set the container environment variables
AppSettings__SourcingStrategy to BlobInventory
inventory manifest to use
• NOTE: for this sourcing strategy we recommend the following queue
configuration
= 1
o First, you need to enable Blob Inventory on your account. Choose the following options:
Rule Name - whatever you want
Container - whatever you want - NOTE: this is the container that the inventory
will be placed in. I named my container “inventory”.
Object to inventory – Blob
Blob types - select all 3, Block blobs, Page blobs, and Append blobs
Blob subtypes - don't select any. Scanning blob version and snapshots is not
supported in this version of the tool.
Blob inventory fields - select everything EXCEPT metadata
Inventory frequency - select “Daily”
Prefix match - leave blank
o You will now need to wait around 24 hrs, depending on the size of your account, to give
Blob Inventory a chance to run. Once it is complete, look in the container you told it to
put your inventories in. You should see a folder for the year, month, day, time, and rule
name that the inventory ran containing the Blob Inventory results.
The first thing to look for is a “<rule name>-manifest.checksum” file. This will be
the last thing that blob inventory writes. If you don't see it then the blob
inventory is still processing.
Next, look for a “<rulename>-manifest.json” file. This contains a summary of the
inventory process including total number of objects/bytes found and a list of all
the CSVs generated.
on the amount of data in your account.
o The sourcerer will start by downloading the inventory file. For each inventory file, it will
put a message in the “<your account name>-checkpoint” queue.
• Each sourcerer instance will process one CSV at a time and will keep track of its progress reading
each CSV in the “<your account name>csvrowcheckpoint” table.
• NOTE: You can turn off blob inventory after it has generated the necessary files the first time.
Troubleshooting ACI containers that do not seem to be doing anything If deploying via ACI and the containers have deployed and after a few minutes you do not see any logs
appearing in Application Insights:
o Check the restart Count, a number greater than zero for a container that has just started is
typically a sign that the container is misconfigured.
Figure 4: Checking ACI restart count
o You can view the logs from within a running ACI instance by navigating to the “Logs” tab, here
you should see the configuration that the running container emitted.
Figure 5: Viewing ACI logs
How many sourcerers can I run at the same time?
• Sourcing and checking can run on the target account at the same time.
• By default, most Azure subscriptions are limited to 100 ACI instances per region, however this
varies by subscription type. If you are scanning multiple accounts in the same region in parallel,
you might need to prioritize to ensure you stay under your subscription limit.
• For accounts with larger files (i.e. over 4 MiB), sourcing should be many times faster than
checking. So, we recommend running fewer sourcerers than checkers.
• We recommend starting with a few sourcerer instances (i.e. 2), monitoring their impact on your
account and then adding/removing instances.
• To determine if you have the right number of sourcerers:
o First, look at the queue transaction counts on the ops storage account. You want to
keep this number under 120,000 transactions per minute per queue. NOTE: You can
create an Azure monitor chart with a 1-minute time grain to simplify monitoring this.
Figure 6: Queue transaction counts in Azure Monitor metrics
o Next, look at the number of messages in the <your account name>-online queue. If this
number is going up, you are sourcing faster than you are checking. NOTE: This number
will stop going up once all the items in your target storage account have been listed. At
that time, you can turn off/delete all your running sourcerers.
Figure 7: Browsing queue messages in Storage browser (preview)
Restarting the sourcing process To restart the sourcerer process from the beginning you will need to:
• Stop all running instances of the sourcerer
o See the deployment script for a sample based on your deployment model
• Break the lease on the blob
o In the Azure Portal find your storage account, in the left menu select Data Storage, Then
Containers, select the leaderelector container
Figure 8: Browsing blob containers in the Azure portal
o You should see a blob with your account name and a “Lease state” of “leased”
Figure 9: Viewing leased blobs
o Select the 3 dots and then “Break Lease”
Figure 10: Breaking a blob lease
• Clear out the contents of:
o The <your account name>-checkpoint queue
o The <your account name>-online and <your account name>-offline queues
o The <your account name>csvrowcheckpoint table (if you are using the blob inventory
sourcerer)
• The process is the same to clear out the above queues
o Select “Data Storage” then “Queues” then the name of the queue you want to clear
Figure 11: Selecting a queue to clear in the Azure portal
o Press the “Clear queue” button
Figure 12: Location of the Clear queue button
How many checkers can I run at the same time?
• Sourcing and checking can run on the target account at the same time.
• By default, most Azure subscriptions are limited to 100 ACI instances per region, however this
varies by subscription type. If you are scanning multiple accounts in the same region in parallel,
you might need to prioritize to ensure you stay under your subscription limit.
• For accounts with larger files (i.e., greater than 4 MiB), sourcing should be many times faster
than checking. So, recommended best practice is running more checkers than sourcerers.
• We recommend starting with a few checker (i.e. 5), monitoring their impact on your account
and then adding/removing instances.
• The checking process is typically limited by the egress limits for the target storage account. It is
very important to keep an eye on the amount of data being pulled from the target account to
ensure enough room for any other users of the data in the account.
• By default, v2 storage accounts have a 50 Gbps egress limit. You can pull the sum of blob egress
summarized by minute from Azure Monitor. (You are looking to stay under 375 GiB per min.)
Figure 13: Viewing account egress in Azure Monitor metrics
If your account has lots of very small files, you will likely not hit the egress limit on your target
storage account. You should also monitor the queue limits on your ops storage account (see the
sourcerer section above for details).
Authentication with SharedKey
• To authenticate with a SharedKey, set the following environment variables on the containers
o AppSettings__TargetAuthenticationMode - SharedKey
target storage accounts in public Azure regions
o AppSettings__TargetStorageAccountName- the name of your storage account, without
the endpoint (i.e. myaccount)
more info
• AppSettings__MaximumQueueRetrievalBackOffDurationInSeconds - time to sleep before polling
the queue again to look for new messages, Default Value=60
• AppSettings__MaximumNumberOfConcurrentMessageHandlers – This is the number of parallel
message handlers to run. The more message handlers used, the more resources your container
needs and typically the more messages the container can process. Default Value=32
• AppSettings__MaximumNumberOfRetriesOnFailure – This variable represents how many times

Documents

AzBlobChecker Deployment and Usage Guide