前言 由於業務需要,需要多台雲伺服器,但是公有雲的帶寬價格不菲,所以不可能給所有的雲伺服器都配上公網IP,一方面是成本的問題,另一方面也是公網安全的問題。 所以通過其它的方式使用無公網的雲伺服器來來實現對外資源的訪問。 一、操作步驟 至少需要有一臺具有公網IP的雲伺服器! 1、開啟ECS的路由轉發功 ...
Linux 系統下的 SMMU 介紹
在電腦系統架構中,與傳統的用於 CPU 訪問記憶體的管理的 MMU 類似,IOMMU (Input Output Memory Management Unit) 將來自系統 I/O 設備的 DMA 請求傳遞到系統互連之前,它會先轉換請求的地址,並對系統 I/O 設備的記憶體訪問事務進行管理和限制。IOMMU 將設備可見的虛擬地址 (IOVA) 映射到物理記憶體地址。不同的硬體體繫結構有不同的 IOMMU 實現,ARM 平臺的 IOMMU 是 SMMU (System Memory Management)。
SMMU 只為來自系統 I/O 設備的記憶體訪問事務提供轉換服務,而不為到系統 I/O 設備的事務提供轉換服務。從系統或 CPU 到系統 I/O 設備的事務由其它方式管理,例如 MMU。下圖展示了 SMMU 在系統中的角色。
來自系統 I/O 設備的記憶體訪問事務指系統 I/O 設備對記憶體的讀寫,到系統 I/O 設備的事務通常指 CPU 訪問系統 I/O 設備內部映射到物理記憶體地址空間的存儲器或寄存器。關於 SMMU 更詳細的介紹,可以參考 IOMMU和Arm SMMU介紹 及 SMMU 軟體指南。關於 SMMU 的寄存器、數據結構和行為的詳細描述,可以參考 ARM 系統記憶體管理單元架構規範版本 3。關於 SMMU 的具體實現,可以參考相關實現的文檔,如 MMU-600 的 Arm CoreLink MMU-600 系統記憶體管理單元技術參考手冊 和 MMU-700 的 Arm® CoreLink™ MMU-700 系統記憶體管理單元技術參考手冊。
SMMU 通過 StreamID 等區分不同的系統 I/O 設備,系統 I/O 設備在通過 SMMU 訪問記憶體時,需要將 StreamID 等信息帶給 SMMU。從系統 I/O 設備的角度來看,包含 SMMU 的系統更精細的結構如下圖:
系統 I/O 設備通過 DMA 訪問記憶體,DMA 請求發出後,在送到 SMMU 和系統互聯之前,先要經過一個稱為 DAA (對於其它實現,可能是其它設備) 的設備,DAA 做第一次地址轉換,之後將記憶體訪問請求信息,包括配置的 StreamID 等送進 SMMU,以做進一步的處理。
在 Linux 系統中,要為某個系統 I/O 設備開啟 SMMU,一般要經過如下步驟:
- SMMU 驅動程式的初始化。這主要包括讀取 dts 文件中的,SMMU 設備節點,探測 SMMU 的硬體設備特性,初始化全局資源及數據結構,如命令隊列、事件隊列、中斷,和流表等,並將 SMMU 設備註冊進 Linux 內核的 IOMMU 子系統。
- 系統 I/O 設備探測、發現,並和驅動程式綁定初始化的過程中,設備和 IOMMU 的綁定。對於使用 DMA 來訪問記憶體的設備,這一般通過調用
of_dma_configure()
/of_dma_configure_id()
函數完成。設備探測、發現,並和驅動程式綁定初始化的過程,需要訪問設備樹 dts 文件里,設備節點定義中與 IOMMU 配置相關的欄位。如在 arch/arm64/boot/dts/renesas/r8a77961.dtsi 文件中:
iommus = <&ipmmu_vc0 19>;
- 系統 I/O 設備驅動程式關於 IOMMU 的配置。這部分通常因具體的硬體系統實現而異。這主要包括調用
dma_coerce_mask_and_coherent()
/dma_set_mask_and_coherent()
函數將 DMA 掩碼和一致性 DMA 掩碼設置為相同的值,以及配置類似前面提到的 DAA 之類的設備。 - 系統 I/O 設備驅動程式分配記憶體。系統 I/O 設備驅動程式通過
dma_alloc_coherent()
等介面分配記憶體,這個過程除了分配記憶體外,還會通過 SMMU 設備驅動程式的操作函數,創建地址轉換表,並完成 SMMU CD 等數據結構的設置。在 Linux 內核中,不同的子系統實際調用的分配 DMA 記憶體的方法不同,但最終都需要調用dma_alloc_coherent()
函數,這樣分配的記憶體,在通過 DMA 訪問時,才會經過 SMMU。 - 訪問分配的記憶體。通過
dma_alloc_coherent()
函數分配到的記憶體,其地址可以提供給系統 I/O 設備的 DMA 配置相關邏輯,後續系統 I/O 設備通過 DMA 訪問記憶體,將經過 SMMU 完成地址轉換。通過 DMA 訪問記憶體時,將經過 SMMU 的地址轉換。
SMMU 的地址轉換藉助於相關的數據結構完成,這主要包括流表及其流表項 STE,上下文描述符表及其表項 CD,和地址轉換表及其表項。STE 存儲流的上下文信息,每個 STE 64 位元組。CD 存儲了與第 1 階段轉換有關的所有設置,每個 CD 64 位元組。地址轉換表用於描述虛擬地址和物理記憶體地址之間的映射關係。流表的結構可以分為 線性流表 和 2 級流表 兩種。線性流表結構如下圖:
2 級流表示例結構如下圖:
上下文描述符表的結構可以分為 單個 CD,單級 CD 表 和 2 級 CD 表 三種情況。單個 CD 示例結構如下圖:
單級 CD 表示例結構如下圖:
2 級 CD 表示例結構如下圖:
SMMU 在做地址轉換時,根據 SMMU 的流表基址寄存器找到流表,通過 StreamID 在流表中找到 STE。之後根據 STE 的配置和 SubstreamID/PASID,找到上下文描述符表及對應的 CD。再根據 CD 中的信息找到地址轉換表,並通過地址轉換表完成最終的地址轉換。
Linux 內核的 IOMMU 子系統相關源碼位於 drivers/iommu,ARM SMMU 驅動實現位於 drivers/iommu/arm/arm-smmu-v3。在 Linux 內核的 SMMU 驅動實現中,做地址轉換所用到的數據結構,在上面提到的不同步驟中創建:
- 流表在 SMMU 驅動程式的初始化過程中創建。如果流表的結構為線性流表,則線性流表中所有的 STE 都被配置為旁路 SMMU,即對應的流不做 SMMU 地址轉換;如果流表的結構為 2 級流表,則流表中為無效的 L1 流表描述符。
- 系統 I/O 設備發現、探測,並和驅動程式綁定初始化的過程中,設備和 IOMMU 綁定時,創建上下文描述符表。如果流表的結構為 2 級流表,這個過程會先創建第 2 級流表,第 2 級流表中的 STE 都被配置為旁路 SMMU。創建上下文描述符表時,同樣分是否需要 2 級上下文描述符表來執行。上下文描述符表創建之後,其地址被寫入 STE。
- 系統 I/O 設備驅動程式分配記憶體的過程中創建地址轉換表。這個過程中,SMMU 驅動程式的回調會被調用,以將地址轉換表的地址放進 CD 中。
Linux 內核中 SMMU 的數據結構
Linux 內核的 IOMMU 子系統用 struct iommu_device
結構體表示一個 IOMMU 硬體設備實例,並用 struct iommu_ops
結構體描述 IOMMU 硬體設備實例支持的操作和能力,這兩個結構體定義 (位於 include/linux/iommu.h 文件中) 如下:
/**
* struct iommu_ops - iommu ops and capabilities
* @capable: check capability
* @domain_alloc: allocate iommu domain
* @domain_free: free iommu domain
* @attach_dev: attach device to an iommu domain
* @detach_dev: detach device from an iommu domain
* @map: map a physically contiguous memory region to an iommu domain
* @unmap: unmap a physically contiguous memory region from an iommu domain
* @flush_iotlb_all: Synchronously flush all hardware TLBs for this domain
* @iotlb_sync_map: Sync mappings created recently using @map to the hardware
* @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
* queue
* @iova_to_phys: translate iova to physical address
* @probe_device: Add device to iommu driver handling
* @release_device: Remove device from iommu driver handling
* @probe_finalize: Do final setup work after the device is added to an IOMMU
* group and attached to the groups domain
* @device_group: find iommu group for a particular device
* @domain_get_attr: Query domain attributes
* @domain_set_attr: Change domain attributes
* @support_dirty_log: Check whether domain supports dirty log tracking
* @switch_dirty_log: Perform actions to start|stop dirty log tracking
* @sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
* @clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap
* @get_resv_regions: Request list of reserved regions for a device
* @put_resv_regions: Free list of reserved regions for a device
* @apply_resv_region: Temporary helper call-back for iova reserved ranges
* @domain_window_enable: Configure and enable a particular window for a domain
* @domain_window_disable: Disable a particular window for a domain
* @of_xlate: add OF master IDs to iommu grouping
* @is_attach_deferred: Check if domain attach should be deferred from iommu
* driver init to device driver init (default no)
* @dev_has/enable/disable_feat: per device entries to check/enable/disable
* iommu specific features.
* @dev_feat_enabled: check enabled feature
* @aux_attach/detach_dev: aux-domain specific attach/detach entries.
* @aux_get_pasid: get the pasid given an aux-domain
* @sva_bind: Bind process address space to device
* @sva_unbind: Unbind process address space from device
* @sva_get_pasid: Get PASID associated to a SVA handle
* @page_response: handle page request response
* @cache_invalidate: invalidate translation caches
* @sva_bind_gpasid: bind guest pasid and mm
* @sva_unbind_gpasid: unbind guest pasid and mm
* @def_domain_type: device default domain type, return value:
* - IOMMU_DOMAIN_IDENTITY: must use an identity domain
* - IOMMU_DOMAIN_DMA: must use a dma domain
* - 0: use the default setting
* @attach_pasid_table: attach a pasid table
* @detach_pasid_table: detach the pasid table
* @pgsize_bitmap: bitmap of all possible supported page sizes
* @owner: Driver module providing these ops
*/
struct iommu_ops {
bool (*capable)(enum iommu_cap);
/* Domain allocation and freeing by the iommu driver */
struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
void (*domain_free)(struct iommu_domain *);
int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
int (*map)(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
size_t size, struct iommu_iotlb_gather *iotlb_gather);
void (*flush_iotlb_all)(struct iommu_domain *domain);
void (*iotlb_sync_map)(struct iommu_domain *domain, unsigned long iova,
size_t size);
void (*iotlb_sync)(struct iommu_domain *domain,
struct iommu_iotlb_gather *iotlb_gather);
phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
void (*probe_finalize)(struct device *dev);
struct iommu_group *(*device_group)(struct device *dev);
int (*domain_get_attr)(struct iommu_domain *domain,
enum iommu_attr attr, void *data);
int (*domain_set_attr)(struct iommu_domain *domain,
enum iommu_attr attr, void *data);
/*
* Track dirty log. Note: Don't concurrently call these interfaces with
* other ops that access underlying page table.
*/
bool (*support_dirty_log)(struct iommu_domain *domain);
int (*switch_dirty_log)(struct iommu_domain *domain, bool enable,
unsigned long iova, size_t size, int prot);
int (*sync_dirty_log)(struct iommu_domain *domain,
unsigned long iova, size_t size,
unsigned long *bitmap, unsigned long base_iova,
unsigned long bitmap_pgshift);
int (*clear_dirty_log)(struct iommu_domain *domain,
unsigned long iova, size_t size,
unsigned long *bitmap, unsigned long base_iova,
unsigned long bitmap_pgshift);
/* Request/Free a list of reserved regions for a device */
void (*get_resv_regions)(struct device *dev, struct list_head *list);
void (*put_resv_regions)(struct device *dev, struct list_head *list);
void (*apply_resv_region)(struct device *dev,
struct iommu_domain *domain,
struct iommu_resv_region *region);
/* Window handling functions */
int (*domain_window_enable)(struct iommu_domain *domain, u32 wnd_nr,
phys_addr_t paddr, u64 size, int prot);
void (*domain_window_disable)(struct iommu_domain *domain, u32 wnd_nr);
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
/* Per device IOMMU features */
bool (*dev_has_feat)(struct device *dev, enum iommu_dev_features f);
bool (*dev_feat_enabled)(struct device *dev, enum iommu_dev_features f);
int (*dev_enable_feat)(struct device *dev, enum iommu_dev_features f);
int (*dev_disable_feat)(struct device *dev, enum iommu_dev_features f);
/* Aux-domain specific attach/detach entries */
int (*aux_attach_dev)(struct iommu_domain *domain, struct device *dev);
void (*aux_detach_dev)(struct iommu_domain *domain, struct device *dev);
int (*aux_get_pasid)(struct iommu_domain *domain, struct device *dev);
struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct *mm,
void *drvdata);
void (*sva_unbind)(struct iommu_sva *handle);
u32 (*sva_get_pasid)(struct iommu_sva *handle);
int (*page_response)(struct device *dev,
struct iommu_fault_event *evt,
struct iommu_page_response *msg);
int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
struct iommu_cache_invalidate_info *inv_info);
int (*sva_bind_gpasid)(struct iommu_domain *domain,
struct device *dev, struct iommu_gpasid_bind_data *data);
int (*sva_unbind_gpasid)(struct device *dev, u32 pasid);
int (*attach_pasid_table)(struct iommu_domain *domain,
struct iommu_pasid_table_config *cfg);
void (*detach_pasid_table)(struct iommu_domain *domain);
int (*def_domain_type)(struct device *dev);
int (*dev_get_config)(struct device *dev, int type, void *data);
int (*dev_set_config)(struct device *dev, int type, void *data);
unsigned long pgsize_bitmap;
struct module *owner;
};
/**
* struct iommu_device - IOMMU core representation of one IOMMU hardware
* instance
* @list: Used by the iommu-core to keep a list of registered iommus
* @ops: iommu-ops for talking to this iommu
* @dev: struct device for sysfs handling
*/
struct iommu_device {
struct list_head list;
const struct iommu_ops *ops;
struct fwnode_handle *fwnode;
struct device *dev;
};
SMMU 驅動程式創建 struct iommu_device
和 struct iommu_ops
結構體的實例並註冊進 IOMMU 子系統中。
Linux 內核的 IOMMU 子系統用 struct dev_iommu
結構體表示一個連接到 IOMMU 的系統 I/O 設備,用 struct iommu_fwspec
表示系統 I/O 設備連接的 IOMMU 設備,這幾個結構體定義 (位於 include/linux/iommu.h 文件中) 如下:
struct fwnode_handle {
struct fwnode_handle *secondary;
const struct fwnode_operations *ops;
struct device *dev;
};
. . . . . .
/**
* struct dev_iommu - Collection of per-device IOMMU data
*
* @fault_param: IOMMU detected device fault reporting data
* @iopf_param: I/O Page Fault queue and data
* @fwspec: IOMMU fwspec data
* @iommu_dev: IOMMU device this device is linked to
* @priv: IOMMU Driver private data
*
* TODO: migrate other per device data pointers under iommu_dev_data, e.g.
* struct iommu_group *iommu_group;
*/
struct dev_iommu {
struct mutex lock;
struct iommu_fault_param *fault_param;
struct iopf_device_param *iopf_param;
struct iommu_fwspec *fwspec;
struct iommu_device *iommu_dev;
void *priv;
};
. . . . . .
/**
* struct iommu_fwspec - per-device IOMMU instance data
* @ops: ops for this device's IOMMU
* @iommu_fwnode: firmware handle for this device's IOMMU
* @iommu_priv: IOMMU driver private data for this device
* @num_ids: number of associated device IDs
* @ids: IDs which this device may present to the IOMMU
*/
struct iommu_fwspec {
const struct iommu_ops *ops;
struct fwnode_handle *iommu_fwnode;
u32 flags;
unsigned int num_ids;
u32 ids[];
};
在 IOMMU 中,每一個 domain 即代表一個 IOMMU 映射地址空間,即一個 page table。一個 group 邏輯上是需要與 domain 進行綁定的,即一個 group 中的所有設備都位於一個 domain 中。在 Linux 內核的 IOMMU 子系統中,domain 由 struct iommu_domain
結構體表示,這個結構體定義 (位於 include/linux/iommu.h 文件中) 如下:
struct iommu_domain {
unsigned type;
const struct iommu_ops *ops;
unsigned long pgsize_bitmap; /* Bitmap of page sizes in use */
iommu_fault_handler_t handler;
void *handler_token;
struct iommu_domain_geometry geometry;
void *iova_cookie;
struct mutex switch_log_lock;
};
Linux 內核的 IOMMU 子系統用 struct iommu_group
結構體表示位於同一個 domain 的設備組,並用 struct group_device
結構體表示設備組中的一個設備。這兩個結構體定義 (位於 drivers/iommu/iommu.c 文件中) 如下:
struct iommu_group {
struct kobject kobj;
struct kobject *devices_kobj;
struct list_head devices;
struct mutex mutex;
struct blocking_notifier_head notifier;
void *iommu_data;
void (*iommu_data_release)(void *iommu_data);
char *name;
int id;
struct iommu_domain *default_domain;
struct iommu_domain *domain;
struct list_head entry;
};
struct group_device {
struct list_head list;
struct device *dev;
char *name;
};
以面向對象的編程方法來看,可以認為在 ARM SMMUv3 驅動程式中,struct iommu_device
和 struct iommu_domain
結構體有其特定的實現,即 struct arm_smmu_device
和 struct arm_smmu_domain
結構體繼承了 struct iommu_device
和 struct iommu_domain
結構體,這兩個結構體定義 (位於 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 文件中) 如下:
/* An SMMUv3 instance */
struct arm_smmu_device {
struct device *dev;
void __iomem *base;
void __iomem *page1;
#define ARM_SMMU_FEAT_2_LVL_STRTAB (1 << 0)
#define ARM_SMMU_FEAT_2_LVL_CDTAB (1 << 1)
#define ARM_SMMU_FEAT_TT_LE (1 << 2)
#define ARM_SMMU_FEAT_TT_BE (1 << 3)
#define ARM_SMMU_FEAT_PRI (1 << 4)
#define ARM_SMMU_FEAT_ATS (1 << 5)
#define ARM_SMMU_FEAT_SEV (1 << 6)
#define ARM_SMMU_FEAT_MSI (1 << 7)
#define ARM_SMMU_FEAT_COHERENCY (1 << 8)
#define ARM_SMMU_FEAT_TRANS_S1 (1 << 9)
#define ARM_SMMU_FEAT_TRANS_S2 (1 << 10)
#define ARM_SMMU_FEAT_STALLS (1 << 11)
#define ARM_SMMU_FEAT_HYP (1 << 12)
#define ARM_SMMU_FEAT_STALL_FORCE (1 << 13)
#define ARM_SMMU_FEAT_VAX (1 << 14)
#define ARM_SMMU_FEAT_RANGE_INV (1 << 15)
#define ARM_SMMU_FEAT_BTM (1 << 16)
#define ARM_SMMU_FEAT_SVA (1 << 17)
#define ARM_SMMU_FEAT_E2H (1 << 18)
#define ARM_SMMU_FEAT_HA (1 << 19)
#define ARM_SMMU_FEAT_HD (1 << 20)
#define ARM_SMMU_FEAT_BBML1 (1 << 21)
#define ARM_SMMU_FEAT_BBML2 (1 << 22)
#define ARM_SMMU_FEAT_ECMDQ (1 << 23)
#define ARM_SMMU_FEAT_MPAM (1 << 24)
u32 features;
#define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
#define ARM_SMMU_OPT_PAGE0_REGS_ONLY (1 << 1)
#define ARM_SMMU_OPT_MSIPOLL (1 << 2)
u32 options;
union {
u32 nr_ecmdq;
u32 ecmdq_enabled;
};
struct arm_smmu_ecmdq *__percpu *ecmdq;
struct arm_smmu_cmdq cmdq;
struct arm_smmu_evtq evtq;
struct arm_smmu_priq priq;
int gerr_irq;
int combined_irq;
unsigned long ias; /* IPA */
unsigned long oas; /* PA */
unsigned long pgsize_bitmap;
#define ARM_SMMU_MAX_ASIDS (1 << 16)
unsigned int asid_bits;
#define ARM_SMMU_MAX_VMIDS (1 << 16)
unsigned int vmid_bits;
DECLARE_BITMAP(vmid_map, ARM_SMMU_MAX_VMIDS);
unsigned int ssid_bits;
unsigned int sid_bits;
struct arm_smmu_strtab_cfg strtab_cfg;
/* IOMMU core code handle */
struct iommu_device iommu;
struct rb_root streams;
struct mutex streams_mutex;
unsigned int mpam_partid_max;
unsigned int mpam_pmg_max;
bool bypass;
};
. . . . . .
struct arm_smmu_domain {
struct arm_smmu_device *smmu;
struct mutex init_mutex; /* Protects smmu pointer */
struct io_pgtable_ops *pgtbl_ops;
bool stall_enabled;
bool non_strict;
atomic_t nr_ats_masters;
enum arm_smmu_domain_stage stage;
union {
struct arm_smmu_s1_cfg s1_cfg;
struct arm_smmu_s2_cfg s2_cfg;
};
struct iommu_domain domain;
/* Unused in aux domains */
struct list_head devices;
spinlock_t devices_lock;
struct list_head mmu_notifiers;
/* Auxiliary domain stuff */
struct arm_smmu_domain *parent;
ioasid_t ssid;
unsigned long aux_nr_devs;
};
在 ARM SMMUv3 驅動程式中,用 struct arm_smmu_master
結構體描述連接到 SMMU 的系統 I/O 設備的 SMMU 私有數據,這個結構體定義 (位於 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 文件中) 如下:
struct arm_smmu_stream {
u32 id;
struct arm_smmu_master *master;
struct rb_node node;
};
/* SMMU private data for each master */
struct arm_smmu_master {
struct arm_smmu_device *smmu;
struct device *dev;
struct arm_smmu_domain *domain;
struct list_head domain_head;
struct arm_smmu_stream *streams;
unsigned int num_streams;
bool ats_enabled;
bool stall_enabled;
bool pri_supported;
bool prg_resp_needs_ssid;
bool sva_enabled;
bool iopf_enabled;
bool auxd_enabled;
struct list_head bonds;
unsigned int ssid_bits;
};
以面向對象的編程方法來看,可以認為 struct arm_smmu_master
結構體繼承了 struct dev_iommu
結構體。
Linux 內核中 SMMU 的數據結構大體有如下的結構關係:
上面這些數據結構,基本上都包含指向 struct device
對象的指針,struct device
則包含指向幾個關鍵 IOMMU 對象的指針。struct device
對象是各個部分的中介者,相關的各個子系統多通過 struct device
對象找到它需要的操作或數據。struct device
結構體中與 IOMMU 相關的欄位主要有如下這些:
struct device {
#ifdef CONFIG_DMA_OPS
const struct dma_map_ops *dma_ops;
#endif
. . . . . .
#ifdef CONFIG_DMA_DECLARE_COHERENT
struct dma_coherent_mem *dma_mem; /* internal for coherent mem
override */
#endif
. . . . . .
struct iommu_group *iommu_group;
struct dev_iommu *iommu;
. . . . . .
#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
bool dma_coherent:1;
#endif
#ifdef CONFIG_DMA_OPS_BYPASS
bool dma_ops_bypass : 1;
#endif
};
除了 IOMMU 子系統的這些數據結構外,在更底層的 SMMU 驅動程式實現中,還定義了許多特定於硬體的數據結構,如:
- 命令隊列項
struct arm_smmu_cmdq_ent
, - 命令隊列
struct arm_smmu_cmdq
, - 擴展命令隊列
struct arm_smmu_ecmdq
, - 事件隊列
struct arm_smmu_evtq
, - PRI 隊列
struct arm_smmu_priq
, - 2 級流表中的 L1 流表描述符
struct arm_smmu_strtab_l1_desc
, - 上下文描述符
struct arm_smmu_ctx_desc
- 2 級上下文描述符表中的 L1 表描述符
struct arm_smmu_l1_ctx_desc
- 上下文描述符配置
struct arm_smmu_ctx_desc_cfg
- 第 1 階段轉換配置
struct arm_smmu_s1_cfg
- 第 2 階段轉換配置
struct arm_smmu_s2_cfg
- 流表配置
struct arm_smmu_strtab_cfg
特定於硬體的這些數據結構基本上與 ARM 官方的硬體說明文檔 SMMU 軟體指南 和 ARM 系統記憶體管理單元架構規範版本 3 中提到的數據結構嚴格的一一對應。
SMMU 相關的操作及過程,和對 SMMU 的訪問,基於上面這些數據結構實現,這種實現的分層結構大體如下圖所示:
系統 I/O 設備發現、探測,並和驅動程式綁定初始化的過程及系統 I/O 設備驅動程式通常調用平臺設備子系統和 DMA 子系統提供的介面,如平臺設備子系統的 of_dma_configure()
/of_dma_configure_id()
和 DMA 子系統的 dma_alloc_coherent()
函數,這些函數的實現中,藉助於更底層的模塊完成。
SMMUv3 設備驅動程式的初始化
Linux 內核啟動早期,會執行 IOMMU 初始化,這主要是執行 iommu_init()
函數,它創建並添加 iommu_groups
kset,這個函數定義 (位於 drivers/iommu/iommu.c 文件中) 如下:
static int __init iommu_init(void)
{
iommu_group_kset = kset_create_and_add("iommu_groups",
NULL, kernel_kobj);
BUG_ON(!iommu_group_kset);
iommu_debugfs_setup();
return 0;
}
core_initcall(iommu_init);
Linux 內核啟動時,可以傳入一些配置 IOMMU 的命令行參數,這包括用於配置預設 domain 類型的 iommu.passthrough
、用於配置 DMA setup 的 iommu.strict
和用於配置等待掛起的頁請求的頁相應的超時時間的 iommu.prq_timeout
。Linux 內核啟動早期,會初始化 IOMMU 子系統,如果沒有通過 Linux 內核的命令行參數配置 IOMMU,則會設置預設的 domain 類型,相關代碼 (位於 drivers/iommu/iommu.c 文件中) 如下:
static unsigned int iommu_def_domain_type __read_mostly;
static bool iommu_dma_strict __read_mostly;
static u32 iommu_cmd_line __read_mostly;
/*
* Timeout to wait for page response of a pending page request. This is
* intended as a basic safety net in case a pending page request is not
* responded for an exceptionally long time. Device may also implement
* its own protection mechanism against this exception.
* Units are in jiffies with a range between 1 - 100 seconds equivalent.
* Default to 10 seconds.
* Setting 0 means no timeout tracking.
*/
#define IOMMU_PAGE_RESPONSE_MAX_TIMEOUT (HZ * 100)
#define IOMMU_PAGE_RESPONSE_DEF_TIMEOUT (HZ * 10)
static unsigned long prq_timeout = IOMMU_PAGE_RESPONSE_DEF_TIMEOUT;
. . . . . .
#define IOMMU_CMD_LINE_DMA_API BIT(0)
static void iommu_set_cmd_line_dma_api(void)
{
iommu_cmd_line |= IOMMU_CMD_LINE_DMA_API;
}
static bool iommu_cmd_line_dma_api(void)
{
return !!(iommu_cmd_line & IOMMU_CMD_LINE_DMA_API);
}
. . . . . .
/*
* Use a function instead of an array here because the domain-type is a
* bit-field, so an array would waste memory.
*/
static const char *iommu_domain_type_str(unsigned int t)
{
switch (t) {
case IOMMU_DOMAIN_BLOCKED:
return "Blocked";
case IOMMU_DOMAIN_IDENTITY:
return "Passthrough";
case IOMMU_DOMAIN_UNMANAGED:
return "Unmanaged";
case IOMMU_DOMAIN_DMA:
return "Translated";
default:
return "Unknown";
}
}
static int __init iommu_subsys_init(void)
{
bool cmd_line = iommu_cmd_line_dma_api();
if (!cmd_line) {
if (IS_ENABLED(CONFIG_IOMMU_DEFAULT_PASSTHROUGH))
iommu_set_default_passthrough(false);
else
iommu_set_default_translated(false);
if (iommu_default_passthrough() && mem_encrypt_active()) {
pr_info("Memory encryption detected - Disabling default IOMMU Passthrough\n");
iommu_set_default_translated(false);
}
}
pr_info("Default domain type: %s %s\n",
iommu_domain_type_str(iommu_def_domain_type),
cmd_line ? "(set via kernel command line)" : "");
return 0;
}
subsys_initcall(iommu_subsys_init);
. . . . . .
static int __init iommu_set_def_domain_type(char *str)
{
bool pt;
int ret;
ret = kstrtobool(str, &pt);
if (ret)
return ret;
if (pt)
iommu_set_default_passthrough(true);
else
iommu_set_default_translated(true);
return 0;
}
early_param("iommu.passthrough", iommu_set_def_domain_type);
static int __init iommu_dma_setup(char *str)
{
return kstrtobool(str, &iommu_dma_strict);
}
early_param("iommu.strict", iommu_dma_setup);
static int __init iommu_set_prq_timeout(char *str)
{
int ret;
unsigned long timeout;
if (!str)
return -EINVAL;
ret = kstrtoul(str, 10, &timeout);
if (ret)
return ret;
timeout = timeout * HZ;
if (timeout > IOMMU_PAGE_RESPONSE_MAX_TIMEOUT)
return -EINVAL;
prq_timeout = timeout;
return 0;
}
early_param("iommu.prq_timeout", iommu_set_prq_timeout);
. . . . . .
void iommu_set_default_passthrough(bool cmd_line)
{
if (cmd_line)
iommu_set_cmd_line_dma_api();
iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;
}
void iommu_set_default_translated(bool cmd_line)
{
if (cmd_line)
iommu_set_cmd_line_dma_api();
iommu_def_domain_type = IOMMU_DOMAIN_DMA;
}
core_initcall
的函數比 subsys_initcall
的函數執行地更早。
IOMMU 子系統初始化之後,就輪到 SMMU 設備驅動程式上場了。SMMUv3 本身是一個平臺設備,其硬體設備信息,包括寄存器映射地址範圍,中斷號等使用的資源,在設備樹 dts
/dtsi
文件中描述。SMMUv3 設備在設備樹文件中的示例設備節點 (位於 arch/arm64/boot/dts/arm/fvp-base-revc.dts 文件中) 如下:
smmu: iommu@2b400000 {
compatible = "arm,smmu-v3";
reg = <0x0 0x2b400000 0x0 0x100000>;
interrupts = <GIC_SPI 74 IRQ_TYPE_EDGE_RISING>,
<GIC_SPI 79 IRQ_TYPE_EDGE_RISING>,
<GIC_SPI 75 IRQ_TYPE_EDGE_RISING>,
<GIC_SPI 77 IRQ_TYPE_EDGE_RISING>;
interrupt-names = "eventq", "gerror", "priq", "cmdq-sync";
dma-coherent;
#iommu-cells = <1>;
msi-parent = <&its 0x10000>;
};
SMMUv3 設備驅動程式載入的入口為 arm_smmu_device_probe()
函數,這個函數定義 (位於 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 文件中) 如下:
static struct arm_smmu_option_prop arm_smmu_options[] = {
{ ARM_SMMU_OPT_SKIP_PREFETCH, "hisilicon,broken-prefetch-cmd" },
{ ARM_SMMU_OPT_PAGE0_REGS_ONLY, "cavium,cn9900-broken-page1-regspace"},
{ 0, NULL},
};
static void parse_driver_options(struct arm_smmu_device *smmu)
{
int i = 0;
do {
if (of_property_read_bool(smmu->dev->of_node,
arm_smmu_options[i].prop)) {
smmu->options |= arm_smmu_options[i].opt;
dev_notice(smmu->dev, "option %s\n",
arm_smmu_options[i].prop);
}
} while (arm_smmu_options[++i].opt);
}
. . . . . .
static int arm_smmu_device_dt_probe(struct platform_device *pdev,
struct arm_smmu_device *smmu)
{
struct device *dev = &pdev->dev;
u32 cells;
int ret = -EINVAL;
if (of_property_read_u32(dev->of_node, "#iommu-cells", &cells))
dev_err(dev, "missing #iommu-cells property\n");
else if (cells != 1)
dev_err(dev, "invalid #iommu-cells value (%d)\n", cells);
else
ret = 0;
parse_driver_options(smmu);
if (of_dma_is_coherent(dev->of_node))
smmu->features |= ARM_SMMU_FEAT_COHERENCY;
return ret;
}
static unsigned long arm_smmu_resource_size(struct arm_smmu_device *smmu)
{
if (smmu->options & ARM_SMMU_OPT_PAGE0_REGS_ONLY)
return SZ_64K;
else
return SZ_128K;
}
. . . . . .
static void __iomem *arm_smmu_ioremap(struct device *dev, resource_size_t start,
resource_size_t size)
{
struct resource res = DEFINE_RES_MEM(start, size);
return devm_ioremap_resource(dev, &res);
}
. . . . . .
static int arm_smmu_device_probe(struct platform_device *pdev)
{
int irq, ret;
struct resource *res;
resource_size_t ioaddr;
struct arm_smmu_device *smmu;
struct device *dev = &pdev->dev;
smmu = devm_kzalloc(dev, sizeof(*smmu), GFP_KERNEL);
if (!smmu) {
dev_err(dev, "failed to allocate arm_smmu_device\n");
return -ENOMEM;
}
smmu->dev = dev;
if (dev->of_node) {
ret = arm_smmu_device_dt_probe(pdev, smmu);
} else {
ret = arm_smmu_device_acpi_probe(pdev, smmu);
if (ret == -ENODEV)
return ret;
}
/* Set bypass mode according to firmware probing result */
smmu->bypass = !!ret;
/* Base address */
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!res)
return -EINVAL;
if (resource_size(res) < arm_smmu_resource_size(smmu)) {
dev_err(dev, "MMIO region too small (%pr)\n", res);
return -EINVAL;
}
ioaddr = res->start;
/*
* Don't map the IMPLEMENTATION DEFINED regions, since they may contain
* the PMCG registers which are reserved by the PMU driver.
*/
smmu->base = arm_smmu_ioremap(dev, ioaddr, ARM_SMMU_REG_SZ);
if (IS_ERR(smmu->base))
return PTR_ERR(smmu->base);
if (arm_smmu_resource_size(smmu) > SZ_64K) {
smmu->page1 = arm_smmu_ioremap(dev, ioaddr + SZ_64K,
ARM_SMMU_REG_SZ);
if (IS_ERR(smmu->page1))
return PTR_ERR(smmu->page1);
} else {
smmu->page1 = smmu->base;
}
/* Interrupt lines */
irq = platform_get_irq_byname_optional(pdev, "combined");
if (irq > 0)
smmu->combined_irq = irq;
else {
irq = platform_get_irq_byname_optional(pdev, "eventq");
if (irq > 0)
smmu->evtq.q.irq = irq;
irq = platform_get_irq_byname_optional(pdev, "priq");
if (irq > 0)
smmu->priq.q.irq = irq;
irq = platform_get_irq_byname_optional(pdev, "gerror");
if (irq > 0)
smmu->gerr_irq = irq;
}
/* Probe the h/w */
ret = arm_smmu_device_hw_probe(smmu);
if (ret)
return ret;
/* Initialise in-memory data structures */
ret = arm_smmu_init_structures(smmu);
if (ret)
return ret;
/* Record our private device structure */
platform_set_drvdata(pdev, smmu);
/* Reset the device */
ret = arm_smmu_device_reset(smmu, false);
if (ret)
return ret;
/* And we're up. Go go go! */
ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
"smmu3.%pa", &ioaddr);
if (ret)
return ret;
iommu_device_set_ops(&smmu->iommu, &arm_smmu_ops);
iommu_device_set_fwnode(&smmu->iommu, dev->fwnode);
ret = iommu_device_register(&smmu->iommu);
if (ret) {
dev_err(dev, "Failed to register iommu\n");
return ret;
}
return arm_smmu_set_bus_ops(&arm_smmu_ops);
}
arm_smmu_device_probe()
函數主要做瞭如下幾件事情:
- 分配
struct arm_smmu_device
對象,這個對象用來在 IOMMU 子系統中描述 SMMUv3 設備。 - 獲取設備樹文件
dts
/dtsi
中的 SMMUv3 設備節點中包含的信息,和引用的資源,這主要包括:- 關於 SMMUv3 設備的信息,如
iommu-cells
,其值必須為 1;options,如是否只有寄存器頁 0 等;SMMU 是否支持 coherent,這主要由設備樹文件中的設備節點的dma-coherent
屬性表示; - SMMUv3 設備的寄存器映射,
arm_smmu_device_probe()
函數會根據 options 的值檢查寄存器映射的範圍大小是否與預期匹配,並重映射 SMMUv3 設備的寄存器映射; - SMMUv3 設備引用的中斷資源,包括用於命令隊列、事件隊列和全局錯誤的中斷資源。
- 關於 SMMUv3 設備的信息,如
- 探測 SMMUv3 設備的硬體特性,這主要按照 ARM 系統記憶體管理單元架構規範版本 3 中定義的寄存器 SMMU_IDR0、SMMU_IDR1、SMMU_IDR3、和 SMMU_IDR5 (另外一些用於提供信息的只讀寄存器包含的信息和 SMMUv3 硬體設備的特性關係不大,SMMU_IDR2 包含 SMMU 為非安全編程介面實現的特性相關的信息,SMMU_IDR4 是一個 SMMU 實現定義的寄存器,SMMU_IIDR 寄存器包含 SMMU 的實現和實現者的信息,以及由實現定義的支持的架構版本信息,SMMU_AIDR 寄存器包含 SMMU 實現遵從的 SMMU 架構版本號信息) 的各個欄位,確認實際的 SMMUv3 硬體設備支持的特性,這主要通過調用
arm_smmu_device_hw_probe()
函數完成。 - 初始化數據結構,這主要包括幾個隊列和流表,隊列包括命令隊列、事件隊列和 PRIQ 隊列。對於流表的初始化,分兩種情況,如果流表的結構為線性流表,則線性流表中所有的 STE 都被配置為旁路 SMMU;如果流表的結構為 2 級流表,則流表中為無效的 L1 流表描述符,這主要通過調用
arm_smmu_init_structures()
函數完成。 - 在設備結構
struct platform_device
對象的私有欄位中記錄struct arm_smmu_device
對象。 - 複位 SMMUv3 設備,這主要包括通過 SMMU_CR0 等寄存器複位硬體設備,設置流表基址寄存器等;以及設置中斷,包括向系統請求中斷及註冊中斷處理程式;初始化數據結構在記憶體中建立各個數據結構,複位 SMMUv3 設備則將各個數據結構的基地址和各種配置寫進對應的設備寄存器中,這主要通過調用
arm_smmu_device_reset()
函數完成。 - 將 SMMUv3 設備註冊到 IOMMU 子系統,這包括為
struct iommu_device
設置struct iommu_ops
和struct fwnode_handle
,並將struct iommu_device
對象註冊進 IOMMU 子系統。struct fwnode_handle
用於匹配 SMMUv3 設備和系統 I/O 設備,這主要通過調用iommu_device_register()
函數完成。 - 為各個匯流排類型設置
struct iommu_ops
,SMMUv3 設備驅動程式和要使用 IOMMU 的系統 I/O 設備的載入順序可能是不確定的;正常情況下,應該是 SMMUv3 設備驅動程式先載入,要使用 IOMMU 的系統 I/O 設備後載入;這裡會處理使用 IOMMU 的系統 I/O 設備先於 SMMUv3 設備驅動程式載入的情況,這主要通過調用arm_smmu_set_bus_ops()
函數完成。
探測 SMMUv3 設備的硬體特性
arm_smmu_device_probe()
函數調用 arm_smmu_device_hw_probe()
函數探測 SMMUv3 設備的硬體特性,後者定義 (位於 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 文件中) 如下:
static int arm_smmu_ecmdq_probe(struct arm_smmu_device *smmu)
{
int ret, cpu;
u32 i, nump, numq, gap;
u32 reg, shift_increment;
u64 addr, smmu_dma_base;
void __iomem *cp_regs, *cp_base;
/* IDR6 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR6);
smmu_reg_dump(smmu);
nump = 1 << FIELD_GET(IDR6_LOG2NUMP, reg);
numq = 1 << FIELD_GET(IDR6_LOG2NUMQ, reg);
smmu->nr_ecmdq = nump * numq;
gap = ECMDQ_CP_RRESET_SIZE >> FIELD_GET(IDR6_LOG2NUMQ, reg);
smmu_dma_base = (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT);
cp_regs = ioremap(smmu_dma_base + ARM_SMMU_ECMDQ_CP_BASE, PAGE_SIZE);
if (!cp_regs)
return -ENOMEM;
for (i = 0; i < nump; i++) {
u64 val, pre_addr;
val = readq_relaxed(cp_regs + 32 * i);
if (!(val & ECMDQ_CP_PRESET)) {
iounmap(cp_regs);
dev_err(smmu->dev, "ecmdq control page %u is memory mode\n", i);
return -EFAULT;
}
if (i && ((val & ECMDQ_CP_ADDR) != (pre_addr + ECMDQ_CP_RRESET_SIZE))) {
iounmap(cp_regs);
dev_err(smmu->dev, "ecmdq_cp memory region is not contiguous\n");
return -EFAULT;
}
pre_addr = val & ECMDQ_CP_ADDR;
}
addr = readl_relaxed(cp_regs) & ECMDQ_CP_ADDR;
iounmap(cp_regs);
cp_base = devm_ioremap(smmu->dev, smmu_dma_base + addr, ECMDQ_CP_RRESET_SIZE * nump);
if (!cp_base)
return -ENOMEM;
smmu->ecmdq = devm_alloc_percpu(smmu->dev, struct arm_smmu_ecmdq *);
if (!smmu->ecmdq)
return -ENOMEM;
ret = arm_smmu_ecmdq_layout(smmu);
if (ret)
return ret;
shift_increment = order_base_2(num_possible_cpus() / smmu->nr_ecmdq);
addr = 0;
for_each_possible_cpu(cpu) {
struct arm_smmu_ecmdq *ecmdq;
struct arm_smmu_queue *q;
ecmdq = *per_cpu_ptr(smmu->ecmdq, cpu);
q = &ecmdq->cmdq.q;
/*
* The boot option "maxcpus=" can limit the number of online
* CPUs. The CPUs that are not selected are not showed in
* cpumask_of_node(node), their 'ecmdq' may be NULL.
*
* (q->ecmdq_prod & ECMDQ_PROD_EN) indicates that the ECMDQ is
* shared by multiple cores and has been initialized.
*/
if (!ecmdq || (q->ecmdq_prod & ECMDQ_PROD_EN))
continue;
ecmdq->base = cp_base + addr;
q->llq.max_n_shift = ECMDQ_MAX_SZ_SHIFT + shift_increment;
ret = arm_smmu_init_one_queue(smmu, q, ecmdq->base, ARM_SMMU_ECMDQ_PROD,
ARM_SMMU_ECMDQ_CONS, CMDQ_ENT_DWORDS, "ecmdq");
if (ret)
return ret;
q->ecmdq_prod = ECMDQ_PROD_EN;
rwlock_init(&q->ecmdq_lock);
ret = arm_smmu_ecmdq_init(&ecmdq->cmdq);
if (ret) {
dev_err(smmu->dev, "ecmdq[%d] init failed\n", i);
return ret;
}
addr += gap;
}
return 0;
}
static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
{
u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | ARM_SMMU_FEAT_HD);
u32 features = 0;
switch (FIELD_GET(IDR0_HTTU, reg)) {
case IDR0_HTTU_ACCESS_DIRTY:
features |= ARM_SMMU_FEAT_HD;
fallthrough;
case IDR0_HTTU_ACCESS:
features |= ARM_SMMU_FEAT_HA;
}
if (smmu->dev->of_node)
smmu->features |= features;
else if (features != fw_features)
/* ACPI IORT sets the HTTU bits */
dev_warn(smmu->dev,
"IDR0.HTTU overridden by FW configuration (0x%x)\n",
fw_features);
}
static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
{
u32 reg;
bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
bool vhe = cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN);
/* IDR0 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
/* 2-level structures */
if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
if (reg & IDR0_CD2L)
smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
/*
* Translation table endianness.
* We currently require the same endianness as the CPU, but this
* could be changed later by adding a new IO_PGTABLE_QUIRK.
*/
switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
case IDR0_TTENDIAN_MIXED:
smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
break;
#ifdef __BIG_ENDIAN
case IDR0_TTENDIAN_BE:
smmu->features |= ARM_SMMU_FEAT_TT_BE;
break;
#else
case IDR0_TTENDIAN_LE:
smmu->features |= ARM_SMMU_FEAT_TT_LE;
break;
#endif
default:
dev_err(smmu->dev, "unknown/unsupported TT endianness!\n");
return -ENXIO;
}
/* Boolean feature flags */
if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
smmu->features |= ARM_SMMU_FEAT_PRI;
if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
smmu->features |= ARM_SMMU_FEAT_ATS;
if (reg & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
if (reg & IDR0_MSI) {
smmu->features |= ARM_SMMU_FEAT_MSI;
if (coherent && !disable_msipolling)
smmu->options |= ARM_SMMU_OPT_MSIPOLL;
}
if (reg & IDR0_HYP) {
smmu->features |= ARM_SMMU_FEAT_HYP;
if (vhe)
smmu->features |= ARM_SMMU_FEAT_E2H;
}
arm_smmu_get_httu(smmu, reg);
/*
* If the CPU is using VHE, but the SMMU doesn't support it, the SMMU
* will create TLB entries for NH-EL1 world and will miss the
* broadcasted TLB invalidations that target EL2-E2H world. Don't enable
* BTM in that case.
*/
if (reg & IDR0_BTM && (!vhe || reg & IDR0_HYP))
smmu->features |= ARM_SMMU_FEAT_BTM;
/*
* The coherency feature as set by FW is used in preference to the ID
* register, but warn on mismatch.
*/
if (!!(reg & IDR0_COHACC) != coherent)
dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration (%s)\n",
coherent ? "true" : "false");
switch (FIELD_GET(IDR0_STALL_MODEL, reg)) {
case IDR0_STALL_MODEL_FORCE:
smmu->features |= ARM_SMMU_FEAT_STALL_FORCE;
fallthrough;
case IDR0_STALL_MODEL_STALL:
smmu->features |= ARM_SMMU_FEAT_STALLS;
}
if (reg & IDR0_S1P)
smmu->features |= ARM_SMMU_FEAT_TRANS_S1;
if (reg & IDR0_S2P)
smmu->features |= ARM_SMMU_FEAT_TRANS_S2;
if (!(reg & (IDR0_S1P | IDR0_S2P))) {
dev_err(smmu->dev, "no translation support!\n");
return -ENXIO;
}
/* We only support the AArch64 table format at present */
switch (FIELD_GET(IDR0_TTF, reg)) {
case IDR0_TTF_AARCH32_64:
smmu->ias = 40;
fallthrough;
case IDR0_TTF_AARCH64:
break;
default:
dev_err(smmu->dev, "AArch64 table format not supported!\n");
return -ENXIO;
}
/* ASID/VMID sizes */
smmu->asid_bits = reg & IDR0_ASID16 ? 16 : 8;
smmu->vmid_bits = reg & IDR0_VMID16 ? 16 : 8;
/* IDR1 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR1);
if (reg & (IDR1_TABLES_PRESET | IDR1_QUEUES_PRESET | IDR1_REL)) {
dev_err(smmu->dev, "embedded implementation not supported\n");
return -ENXIO;
}
if (reg & IDR1_ECMDQ)
smmu->features |= ARM_SMMU_FEAT_ECMDQ;
/* Queue sizes, capped to ensure natural alignment */
smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
FIELD_GET(IDR1_CMDQS, reg));
if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
/*
* We don't support splitting up batches, so one batch of
* commands plus an extra sync needs to fit inside the command
* queue. There's also no way we can handle the weird alignment
* restrictions on the base pointer for a unit-length queue.
*/
dev_err(smmu->dev, "command queue size <= %d entries not supported\n",
CMDQ_BATCH_ENTRIES);
return -ENXIO;
}
smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
FIELD_GET(IDR1_EVTQS, reg));
smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
FIELD_GET(IDR1_PRIQS, reg));
/* SID/SSID sizes */
smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
smmu->sid_bits = FIELD_GET(IDR1_SIDSIZE, reg);
/*
* If the SMMU supports fewer bits than would fill a single L2 stream
* table, use a linear table instead.
*/
if (smmu->sid_bits <= STRTAB_SPLIT)
smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
/* IDR3 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
switch (FIELD_GET(IDR3_BBML, reg)) {
case IDR3_BBML0:
break;
case IDR3_BBML1:
smmu->features |= ARM_SMMU_FEAT_BBML1;
break;
case IDR3_BBML2:
smmu->features |= ARM_SMMU_FEAT_BBML2;
break;
default:
dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
return -ENXIO;
}
if (FIELD_GET(IDR3_RIL, reg))
smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
if (reg & IDR3_MPAM) {
reg = readl_relaxed(smmu->base + ARM_SMMU_MPAMIDR);
smmu->mpam_partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, reg);
smmu->mpam_pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, reg);
if (smmu->mpam_partid_max || smmu->mpam_pmg_max)
smmu->features |= ARM_SMMU_FEAT_MPAM;
}
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
/* Maximum number of outstanding stalls */
smmu->evtq.max_stalls = FIELD_GET(IDR5_STALL_MAX, reg);
/* Page sizes */
if (reg & IDR5_GRAN64K)
smmu->pgsize_bitmap |= SZ_64K | SZ_512M;
if (reg & IDR5_GRAN16K)
smmu->pgsize_bitmap |= SZ_16K | SZ_32M;
if (reg & IDR5_GRAN4K)
smmu->pgsize_bitmap |= SZ_4K | SZ_2M | SZ_1G;
/* Input address size */
if (FIELD_GET(IDR5_VAX, reg) == IDR5_VAX_52_BIT)
smmu->features |= ARM_SMMU_FEAT_VAX;
/* Output address size */
switch (FIELD_GET(IDR5_OAS, reg)) {
case IDR5_OAS_32_BIT:
smmu->oas = 32;
break;
case IDR5_OAS_36_BIT:
smmu->oas = 36;
break;
case IDR5_OAS_40_BIT:
smmu->oas = 40;
break;
case IDR5_OAS_42_BIT:
smmu->oas = 42;
break;
case IDR5_OAS_44_BIT:
smmu->oas = 44;
break;
case IDR5_OAS_52_BIT:
smmu->oas = 52;
smmu->pgsize_bitmap |= 1ULL << 42; /* 4TB */
break;
default:
dev_info(smmu->dev,
"unknown output address size. Truncating to 48-bit\n");
fallthrough;
case IDR5_OAS_48_BIT:
smmu->oas = 48;
}
if (arm_smmu_ops.pgsize_bitmap == -1UL)
arm_smmu_ops.pgsize_bitmap = smmu->pgsize_bitmap;
else
arm_smmu_ops.pgsize_bitmap |= smmu->pgsize_bitmap;
/* Set the DMA mask for our table walker */
if (dma_set_mask_and_coherent(smmu->dev, DMA_BIT_MASK(smmu->oas)))
dev_warn(smmu->dev,
"failed to set DMA mask for table walker\n");
smmu->ias = max(smmu->ias, smmu->oas);
if (arm_smmu_sva_supported(smmu))
smmu->features |= ARM_SMMU_FEAT_SVA;
dev_info(smmu->dev, "ias %lu-bit, oas %lu-bit (features 0x%08x)\n",
smmu->ias, smmu->oas, smmu->features);
if (smmu->features & ARM_SMMU_FEAT_ECMDQ) {
int err;
err = arm_smmu_ecmdq_probe(smmu);
if (err) {
dev_err(smmu->dev, "suppress ecmdq feature, errno=%d\n", err);
smmu->ecmdq_enabled = 0;
}
}
return 0;
}
在 struct arm_smmu_device
結構體中,SMMUv3 驅動程式用一個 32 位的值來描述支持的硬體特性,其中每個特性用一位來表示。函數 arm_smmu_device_hw_probe()
通過讀取 SMMU 的寄存器獲取 SMMU 的硬體特性。
SMMU_IDR0 寄存器:
- 是否支持兩級流表
- 是否支持兩級上下文描述符 (CD) 表
- 支持的轉換表的位元組序
- 是否支持 PRI
- 是否支持 ATS
- 是否支持 SEV
- 是否支持 MSI
- 是否支持 HYP
- HTTU 特性
- 是否支持 BTM
- 是否支持 COHACC
- 是否支持 STALL
- 是否支持第 1 階段轉換
- 是否支持第 2 階段轉換
- IAS (輸入地址大小) 的值
- ASID bits
- VMID bits
SMMU_IDR1 寄存器 (部分欄位被忽略,如 ATTR_TYPE_OVR 和 ATTR_PERMS_OVR):
- 流表基地址和流表配置是否固定
- 命令隊列、事件隊列和 PRI 隊列基址是否固定
- 基址固定時,基址寄存器包含的是絕對地址還是相對地址,SMMUv3 設備驅動程式要求流表基地址和流表配置不固定,命令隊列、事件隊列和 PRI 隊列基址不固定
- 是否支持擴展的命令隊列
- 命令隊列、事件隊列和 PRI 隊列的大小
- StreamID SID 的大小
- SubstreamID SSID 的大小
SMMU_IDR3 寄存器 (部分欄位被忽略):
- 支持的 BBML
- 是否支持 RIL
- 是否支持 MPAM,支持 MPAM 時,還會讀取 MPAM 的寄存器獲得更多信息
SMMU_IDR5 寄存器:
- SMMU 和系統支持的未完成停滯事務的最大數目。
- 支持的頁大小
- 虛擬地址擴展 VAX,即支持的虛擬地址大小
- 輸出地址大小 OAS
此外,arm_smmu_device_hw_probe()
函數還會探測是否支持 SVA,當前面檢測到支持擴展的命令隊列時,還會讀取 ARM_SMMU_IDR6 寄存器檢測 ECMDQ 的特性。
arm_smmu_device_hw_probe()
函數按照 ARM 系統記憶體管理單元架構規範版本 3 中定義的寄存器各個欄位的含義執行。
初始化數據結構
初始化數據結構主要通過調用 arm_smmu_init_structures()
函數完成,這個函數定義 (位於 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 文件中) 如下:
/* Stream table manipulation functions */
static void
arm_smmu_write_strtab_l1_desc(__le64 *dst, struct arm_smmu_strtab_l1_desc *desc)
{
u64 val = 0;
val |= FIELD_PREP(STRTAB_L1_DESC_SPAN, desc->span);
val |= desc->l2ptr_dma & STRTAB_L1_DESC_L2PTR_MASK;
/* See comment in arm_smmu_write_ctx_desc() */
WRITE_ONCE(*dst, cpu_to_le64(val));
}
static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
{
struct arm_smmu_cmdq_ent cmd = {
.opcode = CMDQ_OP_CFGI_STE,
.cfgi = {
.sid = sid,
.leaf = true,
},
};
arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
}
static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
__le64 *dst)
{
/*
* This is hideously complicated, but we only really care about
* three cases at the moment:
*
* 1. Invalid (all zero) -> bypass/fault (init)
* 2. Bypass/fault -> translation/bypass (attach)
* 3. Translation/bypass -> bypass/fault (detach)
*
* Given that we can't update the STE atomically and the SMMU
* doesn't read the thing in a defined order, that leaves us
* with the following maintenance requirements:
*
* 1. Update Config, return (init time STEs aren't live)
* 2. Write everything apart from dword 0, sync, write dword 0, sync
* 3. Update Config, sync
*/
u64 val = le64_to_cpu(dst[0]);
bool ste_live = false;
struct arm_smmu_device *smmu = NULL;
struct arm_smmu_s1_cfg *s1_cfg = NULL;
struct arm_smmu_s2_cfg *s2_cfg = NULL;
struct arm_smmu_domain *smmu_domain = NULL;
struct arm_smmu_cmdq_ent prefetch_cmd = {
.opcode = CMDQ_OP_PREFETCH_CFG,
.prefetch = {
.sid = sid,
},
};
if (master) {
smmu_domain = master->domain;
smmu = master->smmu;
}
if (smmu_domain) {
switch (smmu_domain->stage) {
case ARM_SMMU_DOMAIN_S1:
s1_cfg = &smmu_domain->s1_cfg;
break;
case ARM_SMMU_DOMAIN_S2:
case ARM_SMMU_DOMAIN_NESTED:
s2_cfg = &smmu_domain->s2_cfg;
break;
default:
break;
}
}
if (val & STRTAB_STE_0_V) {
switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
case STRTAB_STE_0_CFG_BYPASS:
break;
case STRTAB_STE_0_CFG_S1_TRANS:
case STRTAB_STE_0_CFG_S2_TRANS:
ste_live = true;
break;
case STRTAB_STE_0_CFG_ABORT:
BUG_ON(!disable_bypass);
break;
default:
BUG(); /* STE corruption */
}
}
/* Nuke the existing STE_0 value, as we're going to rewrite it */
val = STRTAB_STE_0_V;
/* Bypass/fault */
if (!smmu_domain || !(s1_cfg || s2_cfg)) {
if (!smmu_domain && disable_bypass)
val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
else
val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
dst[0] = cpu_to_le64(val);
dst[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
STRTAB_STE_1_SHCFG_INCOMING));
dst[2] = 0; /* Nuke the VMID */
/*
* The SMMU can perform negative caching, so we must sync
* the STE regardless of whether the old value was live.
*/
if (smmu)
arm_smmu_sync_ste_for_sid(smmu, sid);
return;
}
if (s1_cfg) {
u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
BUG_ON(ste_live);
dst[1] = cpu_to_le64(
FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
FIELD_PREP(STRTAB_STE_1_STRW, strw));
if (master->prg_resp_needs_ssid)
dst[1] |= cpu_to_le64(STRTAB_STE_1_PPAR);
if (smmu->features & ARM_SMMU_FEAT_STALLS &&
!master->stall_enabled)
dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
val |= (s1_cfg->cdcfg.cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
FIELD_PREP(STRTAB_STE_0_S1CDMAX, s1_cfg->s1cdmax) |
FIELD_PREP(STRTAB_STE_0_S1FMT, s1_cfg->s1fmt);
}
if (s2_cfg) {
BUG_ON(ste_live);
dst[2] = cpu_to_le64(
FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
#ifdef __BIG_ENDIAN
STRTAB_STE_2_S2ENDI |
#endif
STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
STRTAB_STE_2_S2R);
dst[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
}
if (master->ats_enabled)
dst[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
STRTAB_STE_1_EATS_TRANS));
pr_info("arm_smmu_write_strtab_ent[%d], val[0]=0x%llx, val[1]=0x%llx, val[2]=0x%llx, val[3]=0x%llx\n",
sid, val, dst[1], dst[2], dst[3]);
arm_smmu_sync_ste_for_sid(smmu, sid);
/* See comment in arm_smmu_write_ctx_desc() */
WRITE_ONCE(dst[0], cpu_to_le64(val));
arm_smmu_sync_ste_for_sid(smmu, sid);
/* It's likely that we'll want to use the new STE soon */
if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
}
static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent)
{
unsigned int i;
for (i = 0; i < nent; ++i) {
arm_smmu_write_strtab_ent(NULL, -1, strtab);
strtab += STRTAB_STE_DWORDS;
}
}
. . . . . .
/* Probing and initialisation functions */
static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
struct arm_smmu_queue *q,
void __iomem *page,
unsigned long prod_off,
unsigned long cons_off,
size_t dwords, const char *name)
{
size_t qsz;
do {
qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
GFP_KERNEL);
if (q->base || qsz < PAGE_SIZE)
break;
q->llq.max_n_shift--;
} while (1);
if (!q->base) {
dev_err(smmu->dev,
"failed to allocate queue (0x%zx bytes) for %s\n",
qsz, name);
return -ENOMEM;
}
if (!WARN_ON(q->base_dma & (qsz - 1))) {
dev_info(smmu->dev, "allocated %u entries for %s\n",
1 << q->llq.max_n_shift, name);
}
q->prod_reg = page + prod_off;
q->cons_reg = page + cons_off;
q->ent_dwords = dwords;
q->q_base = Q_BASE_RWA;
q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
q->llq.prod = q->llq.cons = 0;
return 0;
}
static void arm_smmu_cmdq_free_bitmap(void *data)
{
unsigned long *bitmap = data;
bitmap_free(bitmap);
}
static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
{
int ret = 0;
struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_long_t *bitmap;
cmdq->shared = 1;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!bitmap) {
dev_err(smmu->dev, "failed to allocate cmdq bitmap\n");
ret = -ENOMEM;
} else {
cmdq->valid_map = bitmap;
devm_add_action(smmu->dev, arm_smmu_cmdq_free_bitmap, bitmap);
}
return ret;
}
static int arm_smmu_ecmdq_init(struct arm_smmu_cmdq *cmdq)
{
unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
cmdq->valid_map = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
if (!cmdq->valid_map)
return -ENOMEM;
return 0;
}
static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
{
int ret;
/* cmdq */
ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, smmu->base,
ARM_SMMU_CMDQ_PROD, ARM_SMMU_CMDQ_CONS,
CMDQ_ENT_DWORDS, "cmdq");
if (ret)
return ret;
ret = arm_smmu_cmdq_init(smmu);
if (ret)
return ret;
/* evtq */
ret = arm_smmu_init_one_queue(smmu, &smmu->evtq.q, smmu->page1,
ARM_SMMU_EVTQ_PROD, ARM_SMMU_EVTQ_CONS,
EVTQ_ENT_DWORDS, "evtq");
if (ret)
return ret;
if ((smmu->features & ARM_SMMU_FEAT_SVA) &&
(smmu->features & ARM_SMMU_FEAT_STALLS)) {
smmu->evtq.iopf = iopf_queue_alloc(dev_name(smmu->dev));
if (!smmu->evtq.iopf)
return -ENOMEM;
}
/* priq */
if (!(smmu->features & ARM_SMMU_FEAT_PRI))
return 0;
if (smmu->features & ARM_SMMU_FEAT_SVA) {
smmu->priq.iopf = iopf_queue_alloc(dev_name(smmu->dev));
if (!smmu->priq.iopf)
return -ENOMEM;
}
init_waitqueue_head(&smmu->priq.wq);
smmu->priq.batch = 0;
return arm_smmu_init_one_queue(smmu, &smmu->priq.q, smmu->page1,
ARM_SMMU_PRIQ_PROD, ARM_SMMU_PRIQ_CONS,
PRIQ_ENT_DWORDS, "priq");
}
static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
{
unsigned int i;
struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
size_t size = sizeof(*cfg->l1_desc) * cfg->num_l1_ents;
void *strtab = smmu->strtab_cfg.strtab;
cfg->l1_desc = devm_kzalloc(smmu->dev, size, GFP_KERNEL);
if (!cfg->l1_desc) {
dev_err(smmu->dev, "failed to allocate l1 stream table desc\n");
return -ENOMEM;
}
for (i = 0; i < cfg->num_l1_ents; ++i) {
arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
strtab += STRTAB_L1_DESC_DWORDS << 3;
}
return 0;
}
#ifdef CONFIG_SMMU_BYPASS_DEV
static void arm_smmu_install_bypass_ste_for_dev(struct arm_smmu_device *smmu,
u32 sid)
{
u64 val;
__le64 *step = arm_smmu_get_step_for_sid(smmu, sid);
if (!step)
return;
val = STRTAB_STE_0_V;
val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
step[0] = cpu_to_le64(val);
step[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
STRTAB_STE_1_SHCFG_INCOMING));
step[2] = 0;
}
static int arm_smmu_prepare_init_l2_strtab(struct device *dev, void *data)
{
u32 sid;
int ret;
struct pci_dev *pdev;
struct arm_smmu_device *smmu = (struct arm_smmu_device *)data;
if (!arm_smmu_device_domain_type(dev))
return 0;
pdev = to_pci_dev(dev);
sid = PCI_DEVID(pdev->bus->number, pdev->devfn);
if (!arm_smmu_sid_in_range(smmu, sid))
return -ERANGE;
ret = arm_smmu_init_l2_strtab(smmu, sid);
if (ret)
return ret;
arm_smmu_install_bypass_ste_for_dev(smmu, sid);
return 0;
}
#endif
static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
{
void *strtab;
u64 reg;
u32 size, l1size;
struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
#ifdef CONFIG_SMMU_BYPASS_DEV
int ret;
#end