Recently I had the need to index documents stored in Azure Storage Blobs. Additionally I wanted to use the blob storage metadata also to add some information to those documents. Because I needed rich text information on the metadata I could use blob storage metadata directly. Se here why.
So I had to use 2 different data sources. One for the documents and another to the metadata. So I chose Azure Blob Storage and Azure Table Storage. This is the full diagram of the final solution:
The indexers are responsible for updating the index with the contents of the 2 different data sources. There is a very important field that in my case it’s called the UniqueIdentifier field because this field is marked with the key property. This is the field that uniquely identifies each document on the Azure Search Index.
And it’s this field that is responsible for correlating the items that come from one data source (documents from blob storage) and items that come from the other data source (records from table storage).
Every document inserted in blob storage has a custom metadata property named also UniqueIdentifier that will have a table storage record associated with the corresponding metadata.
// Retrieve storage account from connection string. CloudStorageAccount storageAccount = CloudStorageAccount.Parse( CloudConfigurationManager.GetSetting("StorageConnectionString")); // Create the blob client. CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient(); // Retrieve a reference to a container. CloudBlobContainer container = blobClient.GetContainerReference("mycontainer"); // Create the container if it doesn't already exist. container.CreateIfNotExists(); // Add some metadata to the container. container.Metadata.Add("UniqueIdentifier", "bla bla bla");
The table storage records have these values stored on the RowKey and the indexer have a mapping instruction to map the origin (RowKey) to the destination (UniqueIdentifier). You can check this mapping instruction later in the indexer schema.
First of all I created the Index:
(I’ve highlighted the UniqueIdentifier field wiht the Key property set to TRUE)
{ "@odata.context": "https://something.search.windows.net/$metadata#indexes/$entity", "@odata.etag": "\"0x8D477848BDEFA50\"", "name": "full-index", "fields": [ { "name": "ClientCode", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "UniqueIdentifier", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": false, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "ETag", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "Timestamp", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "Key", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "Opportunity", "type": "Edm.String", "searchable": true, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": "pt-Pt.lucene" }, { "name": "OpportunityCode", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "DocumentExtension", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "DocumentType", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "ClientName", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": "pt-Pt.lucene" }, { "name": "Year", "type": "Edm.Int32", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "Path", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "LocalPath", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "Name", "type": "Edm.String", "searchable": true, "filterable": false, "retrievable": true, "sortable": true, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": "pt-Pt.lucene" }, { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "retrievable": false, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": "pt-Pt.lucene" }, { "name": "metadata_storage_content_type", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_storage_size", "type": "Edm.Int64", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_storage_last_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_storage_content_md5", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_storage_name", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_storage_path", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_author", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_content_type", "type": "Edm.String", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_creation_date", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null }, { "name": "metadata_last_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": false, "retrievable": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null } ], "scoringProfiles": [], "defaultScoringProfile": "", "corsOptions": null, "suggesters": [ { "name": "temp-suggester", "searchMode": "analyzingInfixMatching", "sourceFields": [ "ClientCode", "Opportunity", "ClientName", "Name" ] } ], "analyzers": [], "tokenizers": [], "tokenFilters": [], "charFilters": [] }
After the index was created I’ve created the container and uploaded my files. This is the corresponding data source:
{ "@odata.context": "https://something.search.windows.net/$metadata#datasources/$entity", "@odata.etag": "\"0x8D526E01DDB35F4\"", "name": "opportunities-datasource", "description": "", "type": "azureblob", "subtype": null, "credentials": { "connectionString": null }, "container": { "name": "opportunities", "query": null }, "dataChangeDetectionPolicy": null, "dataDeletionDetectionPolicy": null }
I’ve also create the Table Storage and inserted the metadata records. This is the data source:
{ "@odata.context": "https://something.search.windows.net/$metadata#datasources/$entity", "@odata.etag": "\"0x8D526E01DB70AF2\"", "name": "ecmcreatemeta", "description": "", "type": "azuretable", "subtype": null, "credentials": { "connectionString": null }, "container": { "name": "opportunities", "query": null }, "dataChangeDetectionPolicy": null, "dataDeletionDetectionPolicy": null }
And these are the indexers. The documents indexer:
{ "@odata.context": "https://something.search.windows.net/$metadata#indexers/$entity", "@odata.etag": "\"0x8D477862D23289E\"", "name": "full-indexer-blob", "description": "", "dataSourceName": "opportunities-datasource", "targetIndexName": "full-index", "schedule": null, "parameters": { "batchSize": 10, "maxFailedItems": 10, "maxFailedItemsPerBatch": 10, "base64EncodeKeys": false, "configuration": { "dataToExtract": "contentAndMetadata", "failOnUnsupportedContentType": false, "indexedFileNameExtensions": ".pdf, .docx, .doc, .xlsx, .xls, .pptx, .ppt, .msg, .html, .htm, .xml, .zip, .eml, .txt, .json, .csv" } }, "fieldMappings": [], "disabled": null }
And the metadata indexer:
{ "@odata.context": "https://something.search.windows.net/$metadata#indexers/$entity", "@odata.etag": "\"0x8D47784D3CE80DF\"", "name": "full-indexer-meta", "description": "", "dataSourceName": "ecmcreatemeta", "targetIndexName": "full-index", "schedule": null, "parameters": { "batchSize": null, "maxFailedItems": 0, "maxFailedItemsPerBatch": 0, "base64EncodeKeys": false, "configuration": {} }, "fieldMappings": [ { "sourceFieldName": "PartitionKey", "targetFieldName": "ClientCode", "mappingFunction": null }, { "sourceFieldName": "RowKey", "targetFieldName": "UniqueIdentifier", "mappingFunction": null } ], "disabled": null }
That is the RowKey UniqueIdentifier mapping function. That is how Azure Search will map the records from table storage to documents inside the index.
Hi Ricardo , Can I have the code for the same?