###@Spark分區器(Partitioner) ####HashPartitioner(預設的分區器) HashPartitioner分區原理是對於給定的key,計算其hashCode,並除以分區的個數取餘,如果餘數小於0,則餘數+分區的個數,最後返回的值就是這個key所屬的分區ID,當key為 ...
@Spark分區器(Partitioner)
HashPartitioner(預設的分區器)
HashPartitioner分區原理是對於給定的key,計算其hashCode,並除以分區的個數取餘,如果餘數小於0,則餘數+分區的個數,最後返回的值就是這個key所屬的分區ID,當key為null值是返回0。
源碼在org.apache.spark包下:
origin code:
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
def numPartitions: Int = partitions
// 根據鍵的值來判斷在哪一個分區
def getPartition(key: Any): Int = key match {
case null => 0 // 鍵為null始終在0分區
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions) // 鍵不為0,根據鍵的hashCode值和分區數進行計算
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
…………
}
// 底層實質:取模運算
def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}
RangePartitioner
HashPartitioner分區的實現可能會導致數據傾斜,極端情況下會導致某些分區擁有RDD的所有數據。而RangePartitioner分區器則儘量保證各個分區數據均勻,而且分區和分區之間是有序的,也就是說令一個分區中的元素均比另一個分區中的元素小或者大;但是分區內的元素是不能保證順序的。簡單地說就是將一定範圍內的數據映射到一個分區內。
sortByKey底層使用的數據分區器就是RangePartitioner分區器,該分區器的實現方式主要通過兩個步驟實現:
①先從整個RDD中抽取樣本數據,將樣本數據排序,計算出每個分區的最大key值,形成一個Array[key]類型的數組變數rangeBounds;
②判斷key在rangeBounds中所處的範圍,給出該key值在下一個RDD中的分區id下標。該分區器要求RDD中的key類型必須是可排序的。
origin code:
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true,
val samplePointsPerPartitionHint: Int = 20)
extends Partitioner {
// A constructor declared in order to maintain backward compatibility for Java, when we add the
// 4th constructor parameter samplePointsPerPartitionHint. See SPARK-22160.
// This is added to make sure from a bytecode point of view, there is still a 3-arg ctor.
def this(partitions: Int, rdd: RDD[_ <: Product2[K, V]], ascending: Boolean) = {
this(partitions, rdd, ascending, samplePointsPerPartitionHint = 20)
}
// We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")
require(samplePointsPerPartitionHint > 0,
s"Sample points per partition must be greater than 0 but found $samplePointsPerPartitionHint")
// 獲取RDD中key類型數據的排序器
private var ordering = implicitly[Ordering[K]]
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
// 如果給定的分區數是一個的情況下,直接返回一個空的集合,表示數據不進行分區
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
// Cast to double to avoid overflowing ints or longs
// 給定總的數據抽樣大小,最多1M的數據量(10^6),最少20倍的RDD分區數量,也就是每個RDD分區至少抽取20條數據
val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
// 計算每個分區抽樣的數據量大小,假設輸入數據每個分區分佈的比較均勻
// 對於超大數據集(分區數量超過5萬的)乘以3會讓數據稍微增大一點,對於分區數低於5萬的數據集,每個分區抽取數據量為60條也不算多
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
// 從RDD中抽取數據,返回值:(總RDD數據量,Array[分區id, 當前分區的數據量, 當前分區抽取的數據])
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
// 如果總的數據量為0(RDD為空),那麼直接返回一個空的數組
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
// 計算總樣本數量和總記錄數的占比,占比最大為1.0
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
// 保存樣本數據的集合buffer
val candidates = ArrayBuffer.empty[(K, Float)]
// 保存數據分佈不均衡的分區id(數據量超過fraction比率的分區)
val imbalancedPartitions = mutable.Set.empty[Int]
// 計算抽取出來的樣本數據
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
// 如果fraction乘以當前分區中的數據量大於之前計算的每個分區的抽樣數據大小,那麼表示當前分區抽取的數據太少了,該分區數據分佈不均衡,需要重新抽取
imbalancedPartitions += idx
} else {
// 當前分區不屬於數據分佈不均衡的分區,計算占比權重,並添加到candidates集合中
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
// 對數據分佈不均衡的RDD分區,重新進行數據抽樣
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
// 獲取數據分佈不均衡的RDD分區,並構成RDD
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
// 隨機種子
val seed = byteswap32(-rdd.id - 1)
// 利用RDD的sample抽樣函數API進行數據抽樣
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
// 將最終的抽樣數據計算出rangeBounds
RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
}
}
}
// 下一個RDD的分區數量是rangeBounds數組中元素數量+1個
def numPartitions: Int = rangeBounds.length + 1
// 二分查找器,內部使用Java中的Arrays提供的二分查找方法
private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]
// 根據RDD的key值返回對應的分區id,從0開始
def getPartition(key: Any): Int = {
// 強制轉換key類型為RDD中原本的數據類型
val k = key.asInstanceOf[K]
var partition = 0
if (rangeBounds.length <= 128) {
// If we have less than 128 partitions naive search
// 如果分區數據小於等於128個,那麼直接本地迴圈尋找當前k所屬的分區下標
while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
partition += 1
}
} else {
// Determine which binary search method to use only once.
// 如果分區數量大於128個,那麼使用二分查找方法尋找對應k所屬的下標
// 但是如果k在rangeBounds中沒有出現,實質上返回的是一個負數(範圍)或者是一個超過rangeBounds大小的數(最後一個分區,比所有的數據都大)
partition = binarySearch(rangeBounds, k)
// binarySearch either returns the match location or -[insertion point]-1
if (partition < 0) {
partition = -partition-1
}
if (partition > rangeBounds.length) {
partition = rangeBounds.length
}
}
// 根據數據排序是升序還是降序進行數據的排列,預設為升序
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}
override def equals(other: Any): Boolean = other match {
case r: RangePartitioner[_, _] =>
r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
case _ =>
false
}
override def hashCode(): Int = {
val prime = 31
var result = 1
var i = 0
while (i < rangeBounds.length) {
result = prime * result + rangeBounds(i).hashCode
i += 1
}
result = prime * result + ascending.hashCode
result
}
@throws(classOf[IOException])
private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException {
val sfactory = SparkEnv.get.serializer
sfactory match {
case js: JavaSerializer => out.defaultWriteObject()
case _ =>
out.writeBoolean(ascending)
out.writeObject(ordering)
out.writeObject(binarySearch)
val ser = sfactory.newInstance()
Utils.serializeViaNestedStream(out, ser) { stream =>
stream.writeObject(scala.reflect.classTag[Array[K]])
stream.writeObject(rangeBounds)
}
}
}
@throws(classOf[IOException])
private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {
val sfactory = SparkEnv.get.serializer
sfactory match {
case js: JavaSerializer => in.defaultReadObject()
case _ =>
ascending = in.readBoolean()
ordering = in.readObject().asInstanceOf[Ordering[K]]
binarySearch = in.readObject().asInstanceOf[(Array[K], K) => Int]
val ser = sfactory.newInstance()
Utils.deserializeViaNestedStream(in, ser) { ds =>
implicit val classTag = ds.readObject[ClassTag[Array[K]]]()
rangeBounds = ds.readObject[Array[K]]()
}
}
}
}
將一定範圍內的數映射到某一個分區內,在實現中,分界(rangeBounds)演算法用到了水塘抽樣演算法。RangePartitioner的重點在於構建rangeBounds數組對象,主要步驟是:
- 如果分區數量小於2或者RDD中不存在數據的情況下,直接返回一個空的數組,不需要計算range的邊界;如果分區數量大於1的情況下,而且RDD中有數據的情況下,才需要計算數組對象
- 計算總體的數據抽樣大小sampleSize,計算規則是:至少每個分區抽取20個數據或者最多1M的數據量
- 根據sampleSize和分區數量計算每個分區的數據抽樣樣本數量sampleSizePartition
- 調用RangePartitioner的sketch函數進行數據抽樣,計算出每個分區的樣本
- 計算樣本的整體占比以及數據量過多的數據分區,防止數據傾斜
- 對於數據量比較多的RDD分區調用RDD的sample函數API重新進行數據獲取
- 將最終的樣本數據通過RangePartitioner的determineBounds函數進行數據排序分配,計算出rangeBounds
RangePartitioner的sketch函數的作用是對RDD中的數據按照需要的樣本數據量進行數據抽取,主要調用SamplingUtils類的reservoirSampleAndCount方法對每個分區進行數據抽取,抽取後計算出整體所有分區的數據量大小;reserviorSampleAndCount方法的抽取方式是先從迭代器中獲取樣本數量個數據(順序獲取),然後對剩餘的數據進行判斷,替換之前的樣本數據,最終達到數據抽樣的效果。RangePartitioner的determineBounds函數的作用是根據樣本數據記憶權重大小確定數據邊界。
RangePartitioner的determineBounds函數的作用是根據樣本數據記憶權重大小確定數據邊界,源代碼如下:
origin code:
/**
* Determines the bounds for range partitioning from candidates with weights indicating how many
* items each represents. Usually this is 1 over the probability used to sample this candidate.
*
* @param candidates unordered candidates with weights
* @param partitions number of partitions
* @return selected bounds
*/
def determineBounds[K : Ordering : ClassTag](
candidates: ArrayBuffer[(K, Float)],
partitions: Int): Array[K] = {
val ordering = implicitly[Ordering[K]]
// 按照數據進行排序,預設升序排序
val ordered = candidates.sortBy(_._1)
// 獲取總的樣本數據大小
val numCandidates = ordered.size
// 計算總的權重大小
val sumWeights = ordered.map(_._2.toDouble).sum
// 計算步長
val step = sumWeights / partitions
var cumWeight = 0.0
var target = step
val bounds = ArrayBuffer.empty[K]
var i = 0
var j = 0
var previousBound = Option.empty[K]
while ((i < numCandidates) && (j < partitions - 1)) {
// 獲取排序後的第i個數據及權重
val (key, weight) = ordered(i)
// 累計權重
cumWeight += weight
if (cumWeight >= target) {
// Skip duplicate values.
// 權重已經達到一個步長的範圍,計算出一個分區id的值
if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {// 上一個邊界值為空,或者當前邊界值key數據大於上一個邊界的值,那麼當前key有效,進行計算
// 添加當前key到邊界集合中
bounds += key
// 累計target步長界限
target += step
// 分區數量加1
j += 1
// 上一個邊界的值重置為當前邊界的值
previousBound = Some(key)
}
}
i += 1
}
// 返回結果
bounds.toArray
}
自定義分區器
自定義分區器是需要繼承org.apache.spark.Partitioner類並實現以下三個方法:
- numPartitioner: Int:返回創建出來的分區數
- getPartition(key: Any): Int:返回給定鍵的分區編號(0到numPartitions - 1)
- equals():Java判斷相等性的標準方法。這個方法的實現非常重要,Spark需要用這個方法來檢查你的分區器是否和其他分區器實例相同,這樣Spark才可以判斷兩個RDD的分區方式是否相同
e.g.1
// CustomPartitioner
import org.apache.spark.Partitioner
/**
* @param numPartition 分區數量
*/
class CustomPartitioner(numPartition: Int) extends Partitioner{
// 返回分區的總數
override def numPartitions: Int = numPartition
// 根據傳入的 key 返回分區的索引
override def getPartition(key: Any): Int = {
key.toString.toInt % numPartition
}
}
// CustomPartitionerDemo
import com.work.util.SparkUtil
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object CustomPartitionerDemo {
def main(args: Array[String]): Unit = {
val sc: SparkContext = SparkUtil.getSparkContext()
println("=================== 原始數據 =====================")
// zipWithIndex 該函數將 RDD 中的元素和這個元素在 RDD 中的 ID(索引號)組合成鍵值對
val data: RDD[(Int, Long)] = sc.parallelize(0 to 10, 1).zipWithIndex()
println(data.collect().toBuffer)
println("=================== 分區和數據組合成 Map =====================")
val func: (Int, Iterator[(Int, Long)]) => Iterator[String] = (index: Int, iter: Iterator[(Int, Long)]) => {
iter.map(x => "[partID:" + index + ", value:" + x + "]")
}
val array: Array[String] = data.mapPartitionsWithIndex(func).collect()
for (i <- array) {
println(i)
}
println("=================== 自定義5個分區和數據組合成 Map =====================")
val rdd1: RDD[(Int, Long)] = data.partitionBy(new CustomPartitioner(5))
val array1: Array[String] = rdd1.mapPartitionsWithIndex(func).collect()
for (i <- array1) {
println(i)
}
}
}
e.g.2
// SubjectPartitioner
import org.apache.spark.Partitioner
import scala.collection.mutable
/**
*
* @param subjects 學科數組
*/
class SubjectPartitioner(subjects: Array[String]) extends Partitioner {
// 創建一個 map 集合用來存儲到分區號和學科
val subject: mutable.HashMap[String, Int] = new mutable.HashMap[String, Int]()
// 定義一個計數器,用來生成自定義分區號
var i = 0
for (s <- subjects) {
// 存儲學科和分區
subject += (s -> i)
// 分區自增
i += 1
}
// 獲取分區數
override def numPartitions: Int = subjects.size
// 獲取分區號(如果傳入 key 不存在,預設將數據存儲到 0 分區)
override def getPartition(key: Any): Int = subject.getOrElse(key.toString, 0)
}
// SubjectPartitionerDemo
import java.net.URL
import com.work.util.SparkUtil
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object SubjectPartitionerDemo {
def main(args: Array[String]): Unit = {
// 獲取上下文對象
val sc: SparkContext = SparkUtil.getSparkContext()
val tuples: RDD[(String, Int)] = sc.textFile("src/main/data/project.txt").map(line => {
val fields: Array[String] = line.split("\t")
for (i <- fields) {
println(i)
}
// 取出 url
val url: String = fields(1)
(url, 1)
})
// 將相同的 url 進行聚合,得到了各個學科的訪問量
val sumed: RDD[(String, Int)] = tuples.reduceByKey(_ + _).cache()
// 從 url 中取出學科的欄位,數據組成:學科,url,統計數量
val subjectAndUC: RDD[(String, (String, Int))] = sumed.map(tup => {
// 用戶 url
val url: String = tup._1
// 統計的訪問量
val count: Int = tup._2
// 學科
val subject: String = new URL(url).getHost
(subject, (url, count))
})
// 將所有學科取出來
val subjects: Array[String] = subjectAndUC.keys.distinct.collect
// 創建自定義分區器對象
val partitioner: SubjectPartitioner = new SubjectPartitioner(subjects)
// 分區
val partitioned: RDD[(String, (String, Int))] = subjectAndUC.partitionBy(partitioner)
// 取 top3
val result: RDD[(String, (String, Int))] = partitioned.mapPartitions(it => {
val list: List[(String, (String, Int))] = it.toList
val sorted: List[(String, (String, Int))] = list.sortBy(_._2._2).reverse
val top3: List[(String, (String, Int))] = sorted.take(3)
// 因為方法的返回值需要一個 iterator
top3.iterator
})
// 存儲數據
result.saveAsTextFile("src/main/data/out/")
// 釋放資源
sc.stop()
}
}