我正在尝试从Databricks SQL Engine上的AWS Glue Catalog查询增量表。它们以三角洲湖格式存储。我有胶水爬虫自动化模式。目录已设置
Incompatible format detected.
A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
SQL 仓库设置 =
spark.databricks.hive.metastore.glueCatalog.enabled : true
使用 AWS 中的 DELTA LAKE 设置的爬网程序会生成下表元数据
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "string",
"comment": ""
},
{
"name": "media",
"type": "string",
"comment": ""
},
{
"name": "media_type",
"type": "string",
"comment": ""
},
{
"name": "title",
"type": "string",
"comment": ""
},
{
"name": "type",
"type": "smallint",
"comment": ""
},
{
"name": "clubmessage_id",
"type": "string",
"comment": ""
}
]
},
"location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
"inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"parameters": {}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
}
}
我正面临着同样的问题。似乎不能使用Spark SQL在Glue中查询增量表,因为设置
spark.databricks.hive.metastore.glueCatalog.enabled : true
意味着该表将是一个hive表。您将需要直接访问S3中的表,失去了元数据目录的优势。
不过,您可以通过以下IAM策略阻止群集访问_delta_log文件夹来从中读取信息:
{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } }
更新位置后,我能够查询由胶水爬虫创建的增量表。在您的情况下,需要将其从:s3://
COMPANY/club/attachment/_symlink_format_manifest更改为s3://COMPANY/club/attachment
这是因为spark上的delta不像hive和presto那样看< code > _ symlink _ format _ manifest 。它只需要知道根目录。
databricks中更新位置的命令如下所示:
ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"
注意:为了使该命令能够工作,还必须设置数据库位置