提问者:小点点

胶水目录w/Delta表连接到Database ricksSQL引擎


我正在尝试从Databricks SQL Engine上的AWS Glue Catalog查询增量表。它们以三角洲湖格式存储。我有胶水爬虫自动化模式。目录已设置

Incompatible format detected.

A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.

To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html

SQL 仓库设置 =

spark.databricks.hive.metastore.glueCatalog.enabled : true

使用 AWS 中的 DELTA LAKE 设置的爬网程序会生成下表元数据

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "media",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "media_type",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "title",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "type",
                    "type": "smallint",
                    "comment": ""
                },
                {
                    "name": "clubmessage_id",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
        "inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
            "parameters": {}
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "UPDATED_BY_CRAWLER": "CRAWLER_NAME",
            "CrawlerSchemaSerializerVersion": "1.0",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "classification": "parquet"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "UPDATED_BY_CRAWLER": "CRAWLER_NAME",
        "CrawlerSchemaSerializerVersion": "1.0",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "classification": "parquet"
    }
}

共2个答案

匿名用户

我正面临着同样的问题。似乎不能使用Spark SQL在Glue中查询增量表,因为设置

spark.databricks.hive.metastore.glueCatalog.enabled : true

意味着该表将是一个hive表。您将需要直接访问S3中的表,失去了元数据目录的优势。

不过,您可以通过以下IAM策略阻止群集访问_delta_log文件夹来从中读取信息:

{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } } 

匿名用户

更新位置后,我能够查询由胶水爬虫创建的增量表。在您的情况下,需要将其从:s3://COMPANY/club/attachment/_symlink_format_manifest更改为s3://COMPANY/club/attachment

这是因为spark上的delta不像hive和presto那样看< code > _ symlink _ format _ manifest 。它只需要知道根目录。

databricks中更新位置的命令如下所示:

ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"

注意:为了使该命令能够工作,还必须设置数据库位置