[AEWS] EKS Spot 인스턴스와 Kubeflow 배포하기

By HanHoRnag | May 06, 2023 | 20 minutes

KANS kubeflow cloud AWS eksctl eks

1
2


AWS EKS Workshop Study (=AEWS)는 EKS Workshop 실습 스터디입니다.
CloudNet@ Gasida(가시다)님이 진행하시며,공개된 AWS EKS Workshop을 기반으로 진행하고 있습니다.

EKS 구축 및 관리 툴인 eksctl은 다양한 구성 옵션을 제공한다. 공식 문서에 정리가 잘 되어 있으며 이번 블로그 글에서 필자 기준의 흥미로운 옵션을 몇 가지 선택하여 테스트한 내용을 공유하고자 한다.

EKS addon 확장을 위한 AWS IAM 정책 생성
EKS 노드로 Spot 인스턴스 사용하기
Spot 인스턴스를 기반으로한 kubeflow 인프라 구성하기

EKS addon 확장을 위한 AWS IAM 정책 생성

EKS addon는 Amazon EKS에서 제공하는 쿠버네티스 클러스터 구성 요소로, 클러스터의 관리, 네트워킹, 로드 밸런싱 등을 담당하는 확장 기능이다. 이러한 addon 을 사용하면 확장 기능의 버전 관리와 업데이트가 쉬워진다. 기본적으로 EKS 설치시 네트워크단의 addon이 설치된다. 설치되는 addon은 다음과 같다.

애드온 이름	설명
CoreDNS	클러스터 내의 DNS 쿼리를 처리하는데 사용된다.
Kube-proxy	쿠버네티스 서비스와 관련된 네트워크 요청을 처리한다.
VPC CNI	쿠버네티스 클러스터 내의 파드 간 네트워킹을 관리하는 Amazon VPC CNI 플러그인이다.

로드밸런싱, 네트워크 등의 추가 addon은 eks 설치 이후 설치가 가능하다. 중요한 점은 addon 설치을 위해서는 필요 IAM 정책이 필요하다. EKS addon 의 확장 기능은 AWS 서비스를 사용하며 이를 위해 AWS 서비스 사용을 위한 IAM 정책이 필요하기 때문이다. eksctl는 addon 설치를 제공하지 않지만, IAM 정책 생성은 제공한다. eksctl를 통해서 노드에 IAM 정책이 부여되는데 부여할 수 있는 리스트는 다음과 같다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


nodeGroups:
  - name: ng-1
    instanceType: m5.xlarge
    desiredCapacity: 1
    iam:
    # addon 정책 부여 
    withAddonPolicies:
      imageBuilder: true  # 이미지 빌더: 사용자 정의 컨테이너 이미지를 빌드하고 관리하는 도구
      autoScaler: true  # 오토 스케일러: 클러스터 내에서 자동으로 노드 및 파드 크기를 조정
      externalDNS: true  # 외부 DNS: 쿠버네티스 서비스와 인그레스 리소스에 대한 외부 DNS 레코드를 관리
      certManager: true  # 인증서 관리자: 쿠버네티스 클러스터 내에서 TLS 인증서를 자동으로 발급 및 관리
      appMesh: true  # 앱 메시: 마이크로서비스 간 통신을 관리하고 모니터링하는 서비스 메시
      appMeshPreview: true  # 앱 메시 프리뷰: 앱 메시의 베타 기능을 미리 사용할 수 있게 해주는 프리뷰 버전
      ebs: true  # Amazon EBS CSI 드라이버: 쿠버네티스 클러스터에서 Amazon EBS 볼륨을 사용할 수 있게 함
      fsx: true  # Amazon FSx CSI 드라이버: 쿠버네티스 클러스터에서 Amazon FSx 파일 시스템을 사용할 수 있게 함
      efs: true  # Amazon EFS CSI 드라이버: 쿠버네티스 클러스터에서 Amazon EFS 파일 시스템을 사용할 수 있게 함
      awsLoadBalancerController: true  # AWS 로드 밸런서 컨트롤러: AWS 로드 밸런서를 쿠버네티스 서비스와 통합
      xRay: true  # AWS X-Ray: 분산 애플리케이션의 성능 문제를 분석하고 디버깅하는 서비스
      cloudWatch: true  # Amazon CloudWatch: AWS 리소스 및 애플리케이션의 모니터링 및 관측을 제공

eksctl 를 통해 설치된 정책은 AWS 콘솔에서 확인이 가능하다.

EKS 노드로 Spot 인스턴스 사용하기

Spot 인스턴스는 AWS의 미사용 컴퓨팅 용량을 할인된 가격으로 제공하는 Amazon EC2 인스턴스 유형이다. Spot 인스턴스는 온디맨드 인스턴스보다 비용이 최대 90% 까지 저렴하게 사용할 수 있지만, 가용성이 떨어질 경우 AWS에 의해 중단될 수 있다. 이러한 이유로 가변 워크로드 처리나 시간에 민감하지 않는 워크로드(데이터 분석, 배치) 작업에 사용된다.

복잡할 것 같지만, Spot 인스턴스의 비용 절감이 가지고 오는 장점이 어마무시하다. 비용 절감의 원리는 다음과 같다.

<a href="https://www.youtube.com/watch?v=ugDrxMqSj-E&t=426s">https://www.youtube.com/watch?v=ugDrxMqSj-E&t=426s</a>

https://www.youtube.com/watch?v=ugDrxMqSj-E&t=426s

그림과 같이 사용자가 원하는 인스턴스의 가격을 정해두면 AWS 에서 사용하지 않는 인스턴스를 한정하여 인스턴스를 제공하는 식이다. 사용자 제시 가격에 따라 최대 90퍼까지 절감이 가능하나, 사용하지 않는 인스턴스가 가변적으로 변하기에 보통 95%로 중단된다고 한다.

AWS 에서 spot 인스턴스를 사용할 수 있는 서비스들이 다양하다. 이번 글에서는 EKS에서 SPOT 인스턴스를 사용하는 경우를 다루겠다.

eksctl 에서 워크 노드에 대해 spot 인스턴스 설정이 가능하다. 다만, 워크 노드 옵션이 관리형 노드 그룹, 비관리형 노드 그룹에 따라 구성 옵션이 다른데 구성 파일을 확인하면 다음과 같다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# spot-ng.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-eks
  region: ap-northeast-2

nodeGroups: # 비관리형 노드 그룹 
- name: spot-1
  minSize: 0
  maxSize: 2
  instancesDistribution:
    maxPrice: 0.017
    instanceTypes: ["t3.small", "t3.medium"] # At least one instance type should be specified
    onDemandBaseCapacity: 0
    onDemandPercentageAboveBaseCapacity: 50
    spotInstancePools: 2

managedNodeGroups: # 관리형 노드 그룹 
- name: spot-m1
  instanceTypes: ["c3.large","c4.large","c5.large","c5d.large","c5n.large","c5a.large"]
  spot: true
  desiredCapacity: 1

# 인스턴스를 설정하지 않으면 m5.large로 설정된다. 
- name: spot-m2
  spot: true
  desiredCapacity: 1

관리형 노드 그룹은 spot: ture 을 통해 비관리형 노드 그룹은 instancesDistribution 을 통해 가능하다. 옵션 설정 부분이 많이 차이나는데 관리형 노드 그룹은 AWS가 알아서 설정해주는 반면 비관리형 노드 그룹은 사용자가 자세하게 비용 및 정책을 설정해야 하기 때문이다.

비관리형 노드 그룹을 통해 구성하는 경우가 많으며 spot인스턴스 사용 예에 대해 이해가 필요하다. 아래는 공식 문서에서 제공하는 예를 이해하기 위해 작성하였다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47


# 50% 스팟 인스턴스와 50% 온디맨드 인스턴스를 사용하는 노드 그룹
nodeGroups:
  - name: ng-1
    minSize: 2
    maxSize: 5
    instancesDistribution:
      maxPrice: 0.017
      instanceTypes: ["t3.small", "t3.medium"] # At least one instance type should be specified
      onDemandBaseCapacity: 0 # 항상 사용 가능한 최소 온디맨드 인스턴스 수 
      onDemandPercentageAboveBaseCapacity: 50 # 초과하는 인스턴스에 대해 온디맨드 인스턴스를 사용할 비율을 설정(백분율) 
      spotInstancePools: 2 # 인스턴스 유형과 가용 영역 풀 설정 제한 

# GPU 인스턴스도 Spot 인스턴스로 사용가능하다. 
nodeGroups:
  - name: ng-gpu
    instanceType: mixed
    desiredCapacity: 1
    instancesDistribution:
      instanceTypes:
        - p2.xlarge
        - p2.8xlarge
        - p2.16xlarge
      maxPrice: 0.50

# capacity-optimized 용량 최적화 전략으로 할당
nodeGroups:
  - name: ng-capacity-optimized
    minSize: 2
    maxSize: 5
    instancesDistribution:
      maxPrice: 0.017
      instanceTypes: ["t3.small", "t3.medium"] # At least one instance type should be specified
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 50 
      spotAllocationStrategy: "capacity-optimized"

# capacity-optimized 용량 최적화 전략으로 할당(우선순위로 인스턴스 타입에서 첫 번째 인스턴스가 우선순위로 선택된다.) 
nodeGroups:
  - name: ng-capacity-optimized-prioritized
    minSize: 2
    maxSize: 5
    instancesDistribution:
      maxPrice: 0.017
      instanceTypes: ["t3a.small", "t3.small"] # At least two instance types should be specified
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: "capacity-optimized-prioritized"

노드 그룹을 배포하면 AWS 콘솔에서 다음과 같이 확인이 가능하다.

1

eksctl create ng --cluster my-eks -f spot-ng.yaml

배포한지 30분이 지났지만 하나의 인스턴스가 실행되었다가 중단되었고 또 다른 인스턴스가 실행되는 것을 확인할 수 있다.

[절감액 요약] 에서 필자가 설정한 인스턴스로 얼마가 절감되었는지 확인할 수 있다.

오오 72퍼센트나..! 유용하게 사용하자!

Spot 인스턴스를 기반으로한 kubeflow 인프라 구성하기

Spot 인스턴스가 가져오는 비용 절감을 통해 쿠버네티스 머신러닝 플랫폼인 kubeflow 인프라를 구성하겠다. 굳이 머신러닝 플랫폼을 정한 이유는 GPU 인스턴스 사용 비용을 최대한으로 절감하고 자원 사용을 최적화시켜줄 수 있어서 선택하였다. eksctl 공식 예에서도 kubeflow에 대한 인프라 구성을 예로 제공하고 있다. 해당 예를 가지고 리전 및 인스턴스를 변경하여 kubeflow 인프라를 구성해보겠다. 아키텍처는 다음과 같다.

kubeflow.drawio.png

GPU 인스턴스(px로 시작)만 Spot 인스턴스로 할당하였다. 최소 0개부터 시작하여 필요할 때만 사용할 수 있도록 하여 비용을 절감할 수 있도록 설정하였다. GPU 인스턴스의 비용을 확인하면 상당히 비싼것을 확인할 수 있는데 서울 리전 기준 p2.xlarge1.465 USD, p3.2xlarge4.234 USD 이다. 약 절반 기준의 비용을 산정해서 Spot 인스턴스의 비용을 설정하였다.
가용 영역을 ap-northeast-2a 에만 설정한 이유는 네트워크 지연 최소화 때문이다. 머신러닝에서 네트워크 지연을 최소화하기위함이며 머신러닝 워크로드 특성상 고가용성을 고려하지 않았다.

아키텍처로 베스천 서버와 EKS 클러스터를 구축할 것이다. 베스천 서버는 cloudformation 을 통한 EC2 서버로 생성하고, EKS 클러스터는 eksctl 구축하겠다. 베스천 서버의 cloudformation 코드는 필자의 깃허브 repo 를 참고하여 배포하자. 중요한 점은 ami를 ubuntu 지정하였는데 kubeflow 설치를 위해서는 설치 환경이 ubuntu이여만 한다.

<a href="https://awslabs.github.io/kubeflow-manifests/docs/deployment/prerequisites/">https://awslabs.github.io/kubeflow-manifests/docs/deployment/prerequisites/</a>

https://awslabs.github.io/kubeflow-manifests/docs/deployment/prerequisites/

이를 위해 베스천 서버를 ubuntu로 구성하였다.

다음은 eksctl 를 통해 EKS 클러스터를 구축하겠다. 구성 yaml 파일은 다음과 같다.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230


# Cost-Optimized EKS cluster for Kubeflow with spot GPU instances and node scale down to zero
# Built in efforts to reducing training costs of ML workloads.
# Supporting tutorial can be found at the following link: 
# https://blog.gofynd.com/how-we-reduced-our-ml-training-costs-by-78-a33805cb00cf
# This spec creates a cluster on EKS with the following active nodes 
# - 2x m5a.2xlarge - Accomodates all pods of Kubeflow
# It also creates the following nodegroups with 0 nodes running unless a pod comes along and requests for the node to get spun up
# - m5a.2xlarge   -- Max Allowed 10 worker nodes
# - p2.xlarge     -- Max Allowed 10 worker nodes
# - p3.2xlarge    -- Max Allowed 10 worker nodes
# - p3.8xlarge    -- Max Allowed 04 worker nodes
# - p3dn.24xlarge -- Max Allowed 01 worker nodes

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  # Name of your cluster, change to whatever you find fit.
  # If changed, make sure to change all nodegroup tags from 
  # 'k8s.io/cluster-autoscaler/my-eks-kubeflow: "owned"' --> 'k8s.io/cluster-autoscaler/your-new-name: "owned"'
  name: my-eks-kubeflow
  # choose your region wisely, this will significantly impact the cost incurred
  region: ap-northeast-2
  # 1.14 Kubernetes version since Kubeflow 1.0 officially supports the same
  version: '1.25'
  tags:
    # Add more cloud tags if needed for billing
    environment: staging

# Add all possible AZs to ensure nodes can be spun up in any AZ later on. 
# THIS CAN'T BE CHANGED LATER. YOU WILL HAVE TO CREATE A NEW CLUSTER TO ADD NEW AZ SUPPORT.
# This list applies to the whole cluster and isn't specific to nodegroups
vpc:
  id: vpc-04686564a10b92c9c
  cidr: 192.168.0.0/16
  securityGroup: sg-0ea8529af823353e9
  nat:
    gateway: HighlyAvailable

  subnets:
    public: 
      public-2a:
        id: subnet-03eeb6d32aa5397bf
        cidr: 192.168.1.0/24
      public-2c:
        id: subnet-023bc1a3fce0cde07
        cidr: 192.168.2.0/24
    private:
      private-2a:
        id: subnet-02c160be5273d5171
        cidr: 192.168.3.0/24
      private-2c:
        id: subnet-018a370a44f973ac4
        cidr: 192.168.4.0/24

iam: 
  withOIDC: true

nodeGroups:
  - name: ng-1
    desiredCapacity: 4
    minSize: 0
    maxSize: 10
    # Set one nodegroup with 100GB volumes for Kubeflow to get deployed. 
    # Kubeflow requirement states 1-2 Nodes with 100GB volume attached to the node. 
    volumeSize: 100
    volumeType: gp2
    instanceType: c5n.xlarge
    privateNetworking: true 
    ssh:
      publicKeyName: eks-terraform-key
    availabilityZones:
      - ap-northeast-2a
    labels:
      node-class: "worker-node"
    tags:
      # EC2 tags required for cluster-autoscaler auto-discovery
      k8s.io/cluster-autoscaler/node-template/label/lifecycle: OnDemand
      k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot: "false"
      k8s.io/cluster-autoscaler/node-template/label/gpu-count: "0"
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/my-eks-kubeflow: "owned"
    iam:
      withAddonPolicies:
        awsLoadBalancerController: true
        autoScaler: true
        cloudWatch: true
        efs: true
        ebs: true 
        externalDNS: true


  - name: 1-gpu-spot-p2-xlarge
    minSize: 0
    maxSize: 10
    instancesDistribution:
      # set your own max price. AWS spot instance prices no longer cross OnDemand price. 
      # Comment out the field to default to OnDemand as max price. 
      maxPrice: 0.7
      instanceTypes: ["p2.xlarge"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: capacity-optimized
    labels:
      lifecycle: Ec2Spot
      aws.amazon.com/spot: "true"
      gpu-count: "1"
    # Stick to one AZ for all GPU nodes. 
    # In case of termination, this will prevent volumes from being unavailable 
    # if the new instance got spun up in another AZ.
    privateNetworking: true 
    ssh:
      publicKeyName: eks-terraform-key
    availabilityZones:
      - ap-northeast-2a
    taints:
      - key: spotInstance
        value: "true"
        effect: PreferNoSchedule
    tags:
      k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
      k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot: "true"
      k8s.io/cluster-autoscaler/node-template/label/gpu-count: "1"
      k8s.io/cluster-autoscaler/node-template/taint/spotInstance: "true:PreferNoSchedule"
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/my-eks-kubeflow: "owned"
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
        awsLoadBalancerController: true
        efs: true
        ebs: true 
        externalDNS: true

  - name: 1-gpu-spot-p3-2xlarge
    minSize: 0
    maxSize: 10
    instancesDistribution:
      # set your own max price. AWS spot instance prices no longer cross OnDemand price. 
      # Comment out the field to default to OnDemand as max price. 
      maxPrice: 2.0
      instanceTypes: ["p3.2xlarge"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: capacity-optimized
    labels:
      lifecycle: Ec2Spot
      aws.amazon.com/spot: "true"
      gpu-count: "1"
    # Stick to one AZ for all GPU nodes. 
    # In case of termination, this will prevent volumes from being unavailable 
    # if the new instance got spun up in another AZ.
    privateNetworking: true 
    ssh:
      publicKeyName: eks-terraform-key
    availabilityZones:
      - ap-northeast-2a
    taints:
      - key: spotInstance
        value: "true"
        effect: PreferNoSchedule
    tags:
      k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
      k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot: "true"
      k8s.io/cluster-autoscaler/node-template/label/gpu-count: "1"
      k8s.io/cluster-autoscaler/node-template/taint/spotInstance: "true:PreferNoSchedule"
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/my-eks-kubeflow: "owned"
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
        awsLoadBalancerController: true
        efs: true
        ebs: true 
        externalDNS: true

  - name: 4-gpu-spot-p3-8xlarge
    minSize: 0
    maxSize: 4
    instancesDistribution:
      # set your own max price. AWS spot instance prices no longer cross OnDemand price. 
      # Comment out the field to default to OnDemand as max price. 
      maxPrice: 4.4
      instanceTypes: ["p3.8xlarge"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: capacity-optimized
    labels:
      lifecycle: Ec2Spot
      aws.amazon.com/spot: "true"
      gpu-count: "4"
    # Stick to one AZ for all GPU nodes. 
    # In case of termination, this will prevent volumes from being unavailable 
    # if the new instance got spun up in another AZ.
    privateNetworking: true 
    ssh:
      publicKeyName: eks-terraform-key
    availabilityZones:
      - ap-northeast-2a
    taints:
      - key: spotInstance
        value: "true"
        effect: PreferNoSchedule
    tags:
      k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
      k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot: "true"
      k8s.io/cluster-autoscaler/node-template/label/gpu-count: "4"
      k8s.io/cluster-autoscaler/node-template/taint/spotInstance: "true:PreferNoSchedule"
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/my-eks-kubeflow: "owned"
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
        awsLoadBalancerController: true
        efs: true
        ebs: true 
        externalDNS: true

addons:
- name: vpc-cni # no version is specified so it deploys the default version
  version: v1.12.6-eksbuild.1
  attachPolicyARNs:
    - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- name: kube-proxy
  version: latest # auto discovers the latest available
- name: coredns
  version: latest # v1.9.3-eksbuild.2

withAddonPolicies 정책에서 efs: true 가 추가된 것을 확인할 수 있는데 머신러닝의 데이터 셋을 공유 스토리지로 활용하여 모델 훈련 및 추론에 대한 더 나은 성능을 얻을 수 있기 때문에 추가하였다.

eksctl를 통해 EKS 클러스터를 구축하자.

1

eksctl create cluster -f kubeflow-infra.yaml

약 20분 정도 소요된다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


root@hanhorang:/home/ubuntu/blog-share/aews-eksctl/example# kubectl get pods -A 
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
kube-system   aws-node-6xqcv                         1/1     Running   0          94s
kube-system   aws-node-dtm6v                         1/1     Running   0          94s
kube-system   aws-node-kn5fj                         1/1     Running   0          94s
kube-system   aws-node-s7grj                         1/1     Running   0          94s
kube-system   coredns-595d647554-f7576               1/1     Running   0          27s
kube-system   coredns-595d647554-jzvlg               1/1     Running   0          27s
kube-system   kube-proxy-r9csf                       1/1     Running   0          94s
kube-system   kube-proxy-thglh                       1/1     Running   0          94s
kube-system   kube-proxy-txzr4                       1/1     Running   0          94s
kube-system   kube-proxy-zltl2                       1/1     Running   0          94s
kube-system   nvidia-device-plugin-daemonset-4p24p   1/1     Running   0          73s
kube-system   nvidia-device-plugin-daemonset-67275   1/1     Running   0          59s
kube-system   nvidia-device-plugin-daemonset-kmh95   1/1     Running   0          61s
kube-system   nvidia-device-plugin-daemonset-kzhtv   1/1     Running   0          72s

해당 파드가 GPU 노드에만 배치될 수 있도록 데몬셋을 수정할 것이다. 다음의 명령어를 통해 수정하자.

1

kubectl edit daemonset/nvidia-device-plugin-daemonset -n kube-system

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        name: nvidia-device-plugin-ds
    spec:
      nodeSelector:    # 추가 
        gpu-count: ""  #추가

수정 후 정상적으로 파드 에러가 사라진 것을 확인할 수 있다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


(terraform-eks@my-eks-kubeflow:N/A) [root@myeks-host example]# kubectl edit daemonset/nvidia-device-plugin-daemonset -n kube-system
daemonset.apps/nvidia-device-plugin-daemonset edited
(terraform-eks@my-eks-kubeflow:N/A) [root@hanhorang example]# kubectl get pods -A 
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
kube-system   aws-node-9x5kp             1/1     Running   0          10m
kube-system   aws-node-bf4lw             1/1     Running   0          10m
kube-system   aws-node-v7gs6             1/1     Running   0          10m
kube-system   aws-node-wv9qj             1/1     Running   0          10m
kube-system   aws-node-xjwss             1/1     Running   0          10m
kube-system   coredns-595d647554-nlfv2   1/1     Running   0          10m
kube-system   coredns-595d647554-zx92r   1/1     Running   0          10m
kube-system   kube-proxy-24wxv           1/1     Running   0          10m
kube-system   kube-proxy-4dh4j           1/1     Running   0          10m
kube-system   kube-proxy-qp9xk           1/1     Running   0          10m
kube-system   kube-proxy-xcfqz           1/1     Running   0          10m
kube-system   kube-proxy-zvn2j           1/1     Running   0          10m

kubeflow 배포

앞 과정에서 구성한 EKS 클러스터에 kubeflow를 배포하겠다. 배포하기 전 kubeflow가 무엇이고 아키텍처가 무엇인지 간단하게 확인하고 넘어가겠다.

공식문서에 따르면 kubeflow는 오픈소스 기반의 ML 플랫폼이다. 플랫폼이라는 말이 중요한 데, 다른 머신러닝 서비스를 만드는 것이 아니라, 오픈소스 기반의 머신러닝 서비스을 합쳐 머신러닝 워크플로를 간소화 시켜주는 플랫폼 서비스로 제공한다는 의미이다.

<a href="https://www.kubeflow.org/">https://www.kubeflow.org/</a>

https://www.kubeflow.org/

어떤 머신러닝 오픈소스를 사용하는 지는 아키텍처를 보면 확인할 수 있다. 클라우드 프로바이더나 로컬인 쿠버네티스 위에서 다양한 머신러닝 서비스 및 addon 서비스를 결합하여 워크플로를 구성한다고 이해하자.

<a href="https://www.kubeflow.org/docs/started/architecture/">https://www.kubeflow.org/docs/started/architecture/</a>

https://www.kubeflow.org/docs/started/architecture/

머신러닝 컴포넌트가 많아 세부적으로는 확인할 수가 없고 큰 구성 별로 확인하겠다.

ML tools: 머신러닝 도구들은 데이터 전처리, 모델 학습, 평가, 최적화 및 배포와 같은 머신러닝 워크플로를 지원하는 소프트웨어 라이브러리 및 프레임워크이다.
Kubeflow applications and scaffolding: Kubeflow 애플리케이션 및 스캐폴딩은 Kubeflow 플랫폼에서 제공하는 기본 뼈대와 도구들로, 사용자가 머신러닝 워크플로를 쉽게 구축하고 관리할 수 있도록 지원한다. 머신러닝 오픈소스 뿐만 아니라 istio, prometheus, argo 등의 오픈소스가 있는 것을 확인할 수 있는데 해당 서비스를 결합하여 대시보드, 서비스 메시, 파이프라인 구성에 사용된다.

kubeflow 배포 전 작업으로 버전 확인 및 필요 addon 설치가 필요하다. 23년 5월 기준 EKS 버전 제공별 kubeflow 버전 지원은 다음의 그림을 통해 참고하자. 필자의 EKS 버전은 1.25로 kubeflow 1.7를 설치하겠다.

다음 과정으로 필요 패키지 및 addon 설치를 진행하자. 패키지의 경우 공식 문서의 명령어를 통해 쉽게 설치가 가능하다.

1
2
3
4
5


export KUBEFLOW_RELEASE_VERSION=v1.7.0
export AWS_RELEASE_VERSION=v1.7.0-aws-b1.0.0
git clone https://github.com/awslabs/kubeflow-manifests.git && cd kubeflow-manifests
git checkout ${AWS_RELEASE_VERSION}
git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream

1
2


# 패키지 설치 명령어 
make install-tools

설치 중 파이썬 필요 패키지 설치 중 에러가 발생한다. 다음의 명령어를 통해 해결하자.

1
2
3


pip install --ignore-installed PyYAML==5.3.1
pip3 install testresources
python3.8 -m pip install -r tests/e2e/requirements.txt

패키지 설치 후, EKS addon인 EBS csi driver 설치가 필요하다.

EBS csi driver은 AWS 공식 문서를 참고하여 설치를 진행하였다. EBS 볼륨 관리를 위한 IAM 정책 및 롤 생성과 드라이버 배포 과정으로 진행하였다.

1
2


# OIDC 확인
aws eks describe-cluster   --name my-eks-kubeflow   --query "cluster.identity.oidc.issuer"   --output text

결과에서 region과 oidc 번호를 메모하자. 아래 IAM 정책 구성에 기입이 필요하다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


# vi aws-ebs-csi-driver-trust-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Federated": "arn:aws:iam::955963799952:oidc-provider/oidc.eks.ap-northeast-2.amazonaws.com/id/D378D41514C8714C26A69DF6ECC0A999"
        },
        "Action": "sts:AssumeRoleWithWebIdentity",
        "Condition": {
          "StringEquals": {
            "oidc.eks.ap-northeast-2.amazonaws.com/id/D378D41514C8714C26A69DF6ECC0A999:aud": "sts.amazonaws.com",
            "oidc.eks.ap-northeast-2.amazonaws.com/id/D378D41514C8714C26A69DF6ECC0A999:sub": "system:serviceaccount:kube-system:ebs-csi-controller-sa"
          }
        }
      }
    ]
  }

1
2
3
4
5
6
7
8
9


# 롤 생성
aws iam create-role \
  --role-name AmazonEKS_EBS_CSI_DriverRole \
  --assume-role-policy-document file://"aws-ebs-csi-driver-trust-policy.json"

# 정책 attach
aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --role-name AmazonEKS_EBS_CSI_DriverRole

정책 attach까지 완료하였으면 해당 정책을 쿠버네티스 내에서 사용하기 위해 사용자 어카운트에 연동이 필요하다.

1
2
3
4
5
6


# sa 생성 
kubectl create sa ebs-csi-controller-sa     -n kube-system
# Role annotation
kubectl annotate serviceaccount ebs-csi-controller-sa \
    -n kube-system \
    eks.amazonaws.com/role-arn=arn:aws:iam::955963799952:role/AmazonEKS_EBS_CSI_DriverRole

연동이 끝났으면 EBS 드라이버를 배포하자, 필자의 경우 eksctl를 통해 진행하였다.

1
2


aws eks create-addon --cluster-name my-eks-kubeflow --addon-name aws-ebs-csi-driver \
  --service-account-role-arn arn:aws:iam::955963799952:role/AmazonEKS_EBS_CSI_DriverRole

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


root@hanhorang:/home/ubuntu/blog-share/aews-eksctl/example# kubectl get pods -A 
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
kube-system   aws-node-6xqcv                         1/1     Running   0          7m46s
kube-system   aws-node-dtm6v                         1/1     Running   0          7m46s
kube-system   aws-node-kn5fj                         1/1     Running   0          7m46s
kube-system   aws-node-s7grj                         1/1     Running   0          7m46s
kube-system   coredns-595d647554-f7576               1/1     Running   0          6m39s
kube-system   coredns-595d647554-jzvlg               1/1     Running   0          6m39s
kube-system   ebs-csi-controller-b576f46c5-2c5sk     5/6     Running   0          16s
kube-system   ebs-csi-controller-b576f46c5-ffwnk     5/6     Running   0          16s
kube-system   ebs-csi-node-6tpm6                     3/3     Running   0          16s
kube-system   ebs-csi-node-djrc4                     3/3     Running   0          16s
kube-system   ebs-csi-node-qtngl                     3/3     Running   0          16s
kube-system   ebs-csi-node-thbfc                     3/3     Running   0          16s
kube-system   kube-proxy-r9csf                       1/1     Running   0          7m46s
kube-system   kube-proxy-thglh                       1/1     Running   0          7m46s
kube-system   kube-proxy-txzr4                       1/1     Running   0          7m46s
kube-system   kube-proxy-zltl2                       1/1     Running   0          7m46s
kube-system   nvidia-device-plugin-daemonset-4p24p   1/1     Running   0          7m25s
kube-system   nvidia-device-plugin-daemonset-67275   1/1     Running   0          7m11s
kube-system   nvidia-device-plugin-daemonset-kmh95   1/1     Running   0          7m13s
kube-system   nvidia-device-plugin-daemonset-kzhtv   1/1     Running   0          7m24

스토리지 addon 배포 이후 PVC의 Default 스토리지클래스 지정이 필요하다. 스토리지 클래스를 생성하고 기본 클래스 설정을 진행하자.

1
2
3
4
5
6
7
8
9


# ebs-sc.yaml  
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer

1
2
3
4


# 스토리지 클래스 배포
kubectl apply -f ebs-sc.yaml
# 기본 클래스 수정
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

마지막으로 kubeflow 배포를 진행하겠다. 배포는 앞서 깃으로 클론한 레파지토리에 매니패스트 명령어를 통해 진행하겠다.

1
2
3
4
5


export CLUSTER_NAME=my-eks-kubeflow 
export CLUSTER_REGION=ap-northeast-2

# 설치 명령어 
make deploy-kubeflow INSTALLATION_OPTION=kustomize DEPLOYMENT_OPTION=vanilla

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


...
All istio pods are running!
==========Installing dex==========
Release "dex" does not exist. Installing it now.
NAME: dex
LAST DEPLOYED: Tue May  2 22:47:51 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Waiting for dex pods to be ready ...
running command: kubectl wait --for=condition=ready pod -l 'app in (dex)' --timeout=240s -n auth
pod/dex-56d9748f89-99ggv condition met
All dex pods are running!
==========Installing oidc-authservice==========
Release "oidc-authservice" does not exist. Installing it now.
NAME: oidc-authservice
LAST DEPLOYED: Tue May  2 22:48:01 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Waiting for oidc-authservice pods to be ready ...
running command: kubectl wait --for=condition=ready pod -l 'app in (authservice)' --timeout=240s -n istio-system
error: timed out waiting for the condition on pods/authservice-0
Waiting for oidc-authservice pods to be ready ...
running command: kubectl wait --for=condition=ready pod -l 'app in (authservice)' --timeout=240s -n istio-system
error: timed out waiting for the condition on pods/authservice-0
Waiting for oidc-authservice pods to be ready ...
running command: kubectl wait --for=condition=ready pod -l 'app in (authservice)' --timeout=240s -n istio-system

스크립트를 통해 구성 요소들이 설치된다. 약 5분정도 소요된다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


root@hanhorang:/home/ubuntu/blog-share/aews-eksctl/kubeflow-manifests# kubectl get pods -A 
NAMESPACE                   NAME                                                     READY   STATUS    RESTARTS        AGE
ack-system                  ack-sagemaker-controller-5667d978b-xhnmn                 0/1     Error     4 (56s ago)     105s
auth                        dex-56d9748f89-nk54k                                     1/1     Running   0               34m
cert-manager                cert-manager-74d949c895-25bt6                            1/1     Running   0               35m
cert-manager                cert-manager-cainjector-d9bc5979d-m6kt8                  1/1     Running   0               35m
cert-manager                cert-manager-webhook-84b7ddd796-m6dmp                    1/1     Running   0               35m
istio-system                authservice-0                                            1/1     Running   0               34m
istio-system                cluster-local-gateway-6955b67f54-tlhp9                   1/1     Running   0               8m16s
istio-system                istio-ingressgateway-67f7b5f88d-n6whr                    1/1     Running   0               34m
istio-system                istiod-56f7cf9bd6-455ht                                  1/1     Running   0               34m
knative-eventing            eventing-controller-c6f5fd6cd-mzfzd                      1/1     Running   0               7m45s
knative-eventing            eventing-webhook-79cd6767-pt9dt                          1/1     Running   0               7m45s
knative-serving             activator-67849589d6-m7wlq                               2/2     Running   0               8m6s
knative-serving             autoscaler-6dbcdd95c7-b5tdc                              2/2     Running   0               8m6s
knative-serving             controller-b9b8855b8-ggzrg                               2/2     Running   0               8m6s
knative-serving             domain-mapping-75cc6d667f-vc5hz                          2/2     Running   0               8m5s
knative-serving             domainmapping-webhook-6dfb78c944-s4d5t                   2/2     Running   0               8m5s
knative-serving             net-istio-controller-5fcd96d76f-pqvnt                    2/2     Running   0               8m5s
knative-serving             net-istio-webhook-7ff9fdf999-48d9c                       2/2     Running   0               8m5s
knative-serving             webhook-69cc5b9849-tbn9r                                 2/2     Running   0               8m5s
kube-system                 aws-node-6xqcv                                           1/1     Running   0               127m
kube-system                 aws-node-dtm6v                                           1/1     Running   0               127m
kube-system                 aws-node-kn5fj                                           1/1     Running   0               127m
kube-system                 aws-node-s7grj                                           1/1     Running   0               127m
kube-system                 coredns-595d647554-f7576                                 1/1     Running   0               125m
kube-system                 coredns-595d647554-jzvlg                                 1/1     Running   0               125m
kube-system                 ebs-csi-controller-b576f46c5-76czd                       6/6     Running   0               15m
kube-system                 ebs-csi-controller-b576f46c5-svcbt                       6/6     Running   0               15m
kube-system                 ebs-csi-node-57dnr                                       3/3     Running   0               15m
kube-system                 ebs-csi-node-dcn5z                                       3/3     Running   0               15m
kube-system                 ebs-csi-node-hbsfg                                       3/3     Running   0               15m
kube-system                 ebs-csi-node-qhhsm                                       3/3     Running   0               15m
kube-system                 kube-proxy-r9csf                                         1/1     Running   0               127m
kube-system                 kube-proxy-thglh                                         1/1     Running   0               127m
kube-system                 kube-proxy-txzr4                                         1/1     Running   0               127m
kube-system                 kube-proxy-zltl2                                         1/1     Running   0               127m
kube-system                 nvidia-device-plugin-daemonset-4p24p                     1/1     Running   0               126m
kube-system                 nvidia-device-plugin-daemonset-67275                     1/1     Running   0               126m
kube-system                 nvidia-device-plugin-daemonset-kmh95                     1/1     Running   0               126m
kube-system                 nvidia-device-plugin-daemonset-kzhtv                     1/1     Running   0               126m
kubeflow-user-example-com   ml-pipeline-ui-artifact-6cb7b9f6fd-jggk2                 2/2     Running   0               110s
kubeflow-user-example-com   ml-pipeline-visualizationserver-7b5889796d-trjjd         2/2     Running   0               110s
kubeflow                    admission-webhook-deployment-6db8bdbb45-7zfzq            1/1     Running   0               5m4s
kubeflow                    cache-server-76cb8f97f9-wqstf                            2/2     Running   0               6m25s
kubeflow                    centraldashboard-655c7d894c-vmv5r                        2/2     Running   0               6m40s
kubeflow                    jupyter-web-app-deployment-76fbf48ff6-j7tkk              2/2     Running   0               4m55s
kubeflow                    katib-controller-8bb4fdf4f-46zsh                         1/1     Running   0               3m6s
kubeflow                    katib-db-manager-f8dc7f465-4z2ch                         1/1     Running   0               3m6s
kubeflow                    katib-mysql-db6dc68c-xj6qr                               1/1     Running   0               3m6s
kubeflow                    katib-ui-7859bc4c67-khc44                                2/2     Running   1 (2m59s ago)   3m6s
kubeflow                    kserve-controller-manager-85b6b6c47d-qxxjp               2/2     Running   0               7m5s
kubeflow                    kserve-models-web-app-99849d9f7-d67hg                    2/2     Running   0               6m51s
kubeflow                    kubeflow-pipelines-profile-controller-59ccbd47b9-9k9s4   1/1     Running   0               6m25s
kubeflow                    metacontroller-0                                         1/1     Running   0               6m23s
kubeflow                    metadata-envoy-deployment-5b6c575b98-rphl6               1/1     Running   0               6m24s
kubeflow                    metadata-grpc-deployment-784b8b5fb4-mmfql                2/2     Running   2 (5m47s ago)   6m24s
kubeflow                    metadata-writer-5899c74595-55kls                         2/2     Running   0               6m24s
kubeflow                    minio-65dff76b66-7g4q8                                   2/2     Running   0               6m24s
kubeflow                    ml-pipeline-cff8bdfff-glgnb                              2/2     Running   0               6m24s
kubeflow                    ml-pipeline-persistenceagent-798dbf666f-kwjhf            2/2     Running   0               6m24s
kubeflow                    ml-pipeline-scheduledworkflow-859ff9cf7b-fkskm           2/2     Running   0               6m24s
kubeflow                    ml-pipeline-ui-6d69549787-v2vl6                          2/2     Running   0               6m23s
kubeflow                    ml-pipeline-viewer-crd-56f7cfd7d9-phhjn                  2/2     Running   1               6m23s
kubeflow                    ml-pipeline-visualizationserver-64447ffc76-zllv2         2/2     Running   0               6m23s
kubeflow                    mysql-c999c6c8-2vhjq                                     2/2     Running   0               6m23s
kubeflow                    notebook-controller-deployment-84c9bfdf76-r5dc9          2/2     Running   1 (4m33s ago)   4m41s
kubeflow                    profiles-deployment-786df9d89d-mwlsj                     3/3     Running   1 (2m4s ago)    2m15s
kubeflow                    tensorboard-controller-deployment-6664b8866f-r6jtv       3/3     Running   1 (2m31s ago)   2m39s
kubeflow                    tensorboards-web-app-deployment-5cb4666798-55j68         2/2     Running   0               2m53s
kubeflow                    training-operator-7589458f95-zvrk9                       1/1     Running   0               3m56s
kubeflow                    volumes-web-app-deployment-59cf57d887-r9gt8              2/2     Running   0               4m17s
kubeflow                    workflow-controller-6547f784cd-mzzmw                     2/2     Running   1 (6m16s ago)   6m23s

🧐 배포 중 트러블슈팅

필자의 경우 구성 요소 중 oidc-authservice 에서 트러블슈팅이 발생했다.

이벤트가 없어 원인을 찾는데 며칠을 소요했다. 원인은 PVC 권한 문제로 EBS CSI Driver 에 대한 IAM role에 대한 OIDC 가 잘못 입력되어 발생하는 것이였다.

AWS 콘솔에서 OIDC 번호를 수정하니 정상적으로 작동되었다.

Kubeflow 맛보기

kubeflow 배포가 완료되었으면 다음의 명령어를 통해 대시보드에 할 수 있다.

1

kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8080:80

접속하면 dex 시스템에 아이디와 비밀번호를 입력하자. 공식 문서에 따르면 기본 아이디와 비밀번호는 user@example.com , 12341234 이다.

이어서 개발 환경인 노트북 서버를 생성해보자. 왼쪽 메뉴에서 [Notebooks] 에서 개발 환경을 설정하자.

🧐 notebook 생성 트러블슈팅

필자의 경우 notebook 생성시 다음과 같이 에러가 나온다. 깃이슈를 확인하니 원인은 쥬피터 내부에서 HTTP 접근에서 생긴 보안 에러였다.

1
2


[403] Could not find CSRF cookie XSRF-TOKEN in the request. 
http://3.38.94.212:8080/jupyter/api/namespaces/kubeflow-user-example-com/notebooks

필자의 경우 배포되어 있는 jupyer notebook을 수정하였다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


kubectl edit deploy/jupyter-web-app-deployment -n kubeflow
---
...
maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: jupyter-web-app
        kustomize.component: jupyter-web-app
    spec:
      containers:
      - env:
        - name: APP_PREFIX
          value: /jupyter
        - name: UI
          value: default
        - name: USERID_HEADER
          value: kubeflow-userid
        - name: USERID_PREFIX
        - name: APP_SECURE_COOKIES
          value: "false" # ture 에서 false 로 수정 ! 
        image: docker.io/kubeflownotebookswg/jupyter-web-app:v1.7.0
        imagePullPolicy: IfNotPresent
        name: jupyter-web-app
        ports:
        - containerPort: 5000
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
        - mountPath: /src/apps/default/static/assets/logos
          name: logos-volume
...

수정 이후 포트포워딩을 다시해서 시도하면 정상적으로 노트북 서버가 배포된다.

노트북 서버에 들어가서 간단하게 테스트해보자!

마치며

이번 글에서는 kubeflow 인프라와 kubeflow 배포까지 구성하였다. 다음 시간에는 kubeflow 기능(파라미터 최적화, GPU 할당) 관련 인프라적인 측면을 딥하게 다뤄보겠다.