Hive-QL的查询调优

Posted by RussXia on April 26, 2019

EXPLAIN

官方wiki地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain explain命令的输出包括两个部分,STAGE DEPENDENCIES 和 STAGE PLANS 。其中 ABSTRACT SYNTAX TREE 在2.1.0中已经被移除了,并打算作为 explain ast 一个单独的命令在4.0.0版本推出。(相关issue:https://issues.apache.org/jira/browse/HIVE-15932

下面以 explain select sum(number) from onecol where number>=4为例,观察explain的输出。

阶段的依赖 STAGE DEPENDENCIES

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

这里显示有两个阶段,Stage-1是根阶段,Stage-0依赖于Stage-1,在Stage-1完成后执行。

阶段的执行计划 STAGE PLANS

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: onecol
            Statistics: Num rows: 3 Data size: 3 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (number >= 4) (type: boolean)
              Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(number)
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

在本例中,有一个Map-Reduce操作阶段(Stage-1)。还有一个客户端获取数据的Fetch Operator。

一些常见的Operator:

  • TableScan 读取数据,常见的属性 alias
  • Select Operator 选取操作
  • Group By Operator 分组聚合, 常见的属性 aggregations、mode , 当没有keys属性时只有一个分组。
  • Reduce Output Operator 输出结果给Reduce , 常见的属性 sort order
  • Fetch Operator 客户端获取数据 , 常见属性 limit

一个Map-Reduce阶段分为两个部分:

  • Map Operator Tree:从表别名到Map Operator Tree的映射
  • Reduce Operator Tree:处理map-reduce任务的reducer的所有行。在本例中,Stage-1的Reduce Operator Tree进行了sum这一聚合操作。

EXPLAIN EXTENDED

explain添加extended关键字,可以输出更多的信息,比如文件名信息等。下面是 ` explain extended select sum(number) from onecol where number>=4` 的部分输出。

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: onecol
            Statistics: Num rows: 3 Data size: 3 Basic stats: COMPLETE Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: (number >= 4) (type: boolean)
              Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(number)
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  null sort order:
                  sort order:
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  tag: -1
                  value expressions: _col0 (type: bigint)
                  auto parallelism: false
      Path -> Alias:
        hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol [onecol]
      Path -> Partition:
        hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol
          Partition
            base file name: onecol
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
              bucket_count -1
              column.name.delimiter ,
              columns number
              columns.comments
              columns.types int
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol
              name default.onecol
              numFiles 3
              numRows 3
              rawDataSize 3
              serialization.ddl struct onecol { i32 number}
              serialization.format 1
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              totalSize 6
              transient_lastDdlTime 1556243893
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
                bucket_count -1
                column.name.delimiter ,
                columns number
                columns.comments
                columns.types int
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol
                name default.onecol
                numFiles 3
                numRows 3
                rawDataSize 3
                serialization.ddl struct onecol { i32 number}
                serialization.format 1
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                totalSize 6
                transient_lastDdlTime 1556243893
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.onecol
            name: default.onecol
      Truncated Path -> Alias:
        /onecol [onecol]
      Needs Tagging: false
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            GlobalTableId: 0
            directory: hdfs://localhost:9000/tmp/hive/ruzzzz/a1b20a04-24ad-4ea9-8285-3f438e8dd133/hive_2019-04-26_11-17-27_657_3718859854493336655-1/-mr-10001/.hive-staging_hive_2019-04-26_11-17-27_657_3718859854493336655-1/-ext-10002
            NumFilesPerFileSink: 1
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            Stats Publishing Key Prefix: hdfs://localhost:9000/tmp/hive/ruzzzz/a1b20a04-24ad-4ea9-8285-3f438e8dd133/hive_2019-04-26_11-17-27_657_3718859854493336655-1/-mr-10001/.hive-staging_hive_2019-04-26_11-17-27_657_3718859854493336655-1/-ext-10002/
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                properties:
                  columns _col0
                  columns.types bigint
                  escape.delim \
                  hive.serialization.extend.additional.nesting.levels true
                  serialization.escape.crlf true
                  serialization.format 1
                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            TotalFiles: 1
            GatherStats: false
            MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

LIMIT

很多情况下,limit语句需要执行整个查询语句,然后在返回部分结果。这种情况通常是浪费的,所以应该尽可能地避免出现这种情况。Hive中有一个属性可以开启,当使用limit语句时,其可以对数据源进行抽样:

<property>
    <name>hive.limit.optimize.enable</name>
    <value>true</value>
</property>

当hive.limit.optimize.enable设置为true以后,另外两个参数也可以设置。

set hive.limit.row.max.size=100000;  //使用limit做数据的子集查询时保证的最小行数据量,默认是100000
set hive.limit.optimize.limit.file=10;  //使用简单limit查询数据子集时,可抽样的最大文件数,默认是10

这个功能的缺点是,有可能输入中有用的数据永远都不会被处理到。例如,像任何一个需要reduce的查询,join 和 group by操作,以及聚合函数的大多数调用等等,都会产生很不同的结果。

join优化

map join

当join的两张表,有一张表是小表时,可以将小的table放到内存中去,然后再对比较大的表格进行map操作。设置 hive.auto.convert.join 为true,hive会在必要的时候启动这个优化。(hive0.11.0以后,默认就是true了)

todo

避免使用count distinct

由于COUNT DISTINCT操作需要用一个Reduce Task来完成,在数据量大时,这一个reduce需要处理的数据量很太大,会导致整个job很难完成。一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换。