L'utilisation du chasse-neige réduira vos coûts d'analyse. Ceci est le premier article, avec des instructions détaillées sur la façon de configurer l'ensemble du processus de transfert d'événements d'une application mobile vers la base de données RedShift. Dans le prochain article, nous verrons de plus près comment assembler un tableau de bord pour afficher les données collectées.
L'utilisation du chasse-neige réduira vos coûts d'analyse. Ceci est le premier article, avec des instructions détaillées sur la façon de configurer l'ensemble du processus de transfert d'événements d'une application mobile vers la base de données RedShift. Dans le prochain article, nous verrons de plus près comment assembler un tableau de bord pour afficher les données collectées.
En guise d'introduction à l'article, le contenu de l'article du Startup
Founder's Guide to Analytics Tristan Handy est excellent , il y a une traduction sur Habré https://habr.com/ru/post/346326/
L'auteur conseille d'utiliser l'outil Snowplow pour l'analyse :
«Migrez des systèmes d'analyse et de suivi des événements existants vers Snowplow Analytics. Snowplow fait
tout ce que
font les outils payants, mais c'est open source. Vous pouvez soit l'héberger vous-même (et
simplement payer le coût de vos instances EC2), soit payer pour héberger le collecteur d'événements dans Snowplow ou
Fivetran. Si vous ne faites pas le saut à ce stade, vous ne pourrez pas collecter de données beaucoup plus détaillées et vous
préparer pour des comptes Segment, Heap ou Mixpanel vraiment énormes dans un proche avenir. Une fois cette
étape franchie, les outils payants peuvent facilement vous facturer 10 000 USD par mois. "
, . Simo Ahava snowplow,
, snowplow
snowplow
, 2 .
:
- Linux / Unix ( Terminal Mac OS X).
- Git — , Snowplow.
- Amazon Web Services 12 .
- .
- (
snowplow.denjoy.ru
), DNS ( ). - Android Snowplow
Tracker
- .
?
, :
:
- , Clojure Collector.
- - AWS Elastic Beanstalk,
AWS Route 53.
- AWS S3.
- , ETL (extract,
transform, load), AWS EMR,
S3.
- AWS Redshift.
, .
0: AWS IAM-
- AWS
. ,
.
AMR
, , Amazon Web Services IAM (Identity and Access
Management) , .
(IAM)
, , :
IAM.
IAM Snowplow.
Services IAM .
- «Groups».
- «Create new group» .
-
snowplow-setup
«Next step». - «Attach Policy», «Next step».
- «Create Group».
«Policy».
- «Create Policy».
- JSON :
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "acm:*", "autoscaling:*", "aws-marketplace:ViewSubscriptions", "aws-marketplace:Subscribe", "aws-marketplace:Unsubscribe", "cloudformation:*", "cloudfront:*", "cloudwatch:*", "ec2:*", "elasticbeanstalk:*", "elasticloadbalancing:*", "elasticmapreduce:*", "es:*", "iam:*", "rds:*", "redshift:*", "s3:*", "sns:*" ], "Resource": "*" } ] }
- «Review policy».
-
snowplow-setup-policy-infrastructure
. - «Create Policy».
«Groups» «snowplow-setup», .
- Permissions «Attach Policy».
-
Snowplow-setup-policy-Infrastructure
«Attach Policy».
«Users» «Add user».
-
snowplow-setup
. - «Programmatic access».
- «Next: Permissions».
- «Add user to group»,
snowplow-setup
, «Next: Tags» - «Next: Tags»
- «Create user».
, , – . CSV, «Download .csv».
, , . , , .
, 0 !
- AWS.
- IAM-
snowplow-setup
.
1: Clojure collector
- DNS
Clojure Collector — , web-endpoint, . -, Apache Tomcat, AWS Elastic Beanstalk. Clojure Collector Tomcat AWS S3, , Clojure Collector, .
Clojure Collector
, , WAR Clojure Collector.
. clojure-collector-1.X.X-standalone.war
.
, Elastic Beanstalk.
AWS Services Elastic Beanstalk.
, AWS, Snowplow, . , . .
Elastic Beanstalk
- «Create Application».
- (,
Snowplow Clojure Collector
). - Platform Tomcat, Tomcat 8.5 with Java 8 running on 64bit Amazon Linux
- Application Code «Upload your code» WAR-.
- «Create application»
- ,
Clojure Collector , . , Applications cookie sp
. , .
! Clojure Collector.
.
S3
Tomcat S3 – . -, HTTP-, , .
S3, Elastic Beanstalk. Elastic Beanstalk AWS.
- .
- «Edit» «Software Configuration».
- «S3 log storage» «Rotate logs».
, , S3 ETL.
«Apply», .
, Elastic Beanstalk - auto-scalable, .
- «Configuration» .
- «Capacity» «Edit».
- «Environment Type» , «Load balanced», , .
, .
Elastic Beanstalk SSL
.
- Services AWS «Route 53» .
- «Create hosted zone».
- Domain Name , .
snowplow.denjoy.ru
. «Public Hosted Zone» «Create hosted zone».
- . NS. .
- , NS , cloudflare.
- 4 NS- . CloudFlare:
, NS- snowplow.denjoy.ru
, NS AWS. .
-, , https://dnschecker.org/.
, , Route 53, . , Route 53 Elastic Beanstalk. , URL- snowplow.denjoy.ru
, DNS AWS, - Clojure Collector. !
- , «Create Record».
- «Simple Routing»
- «Define simple record»
- Dans la fenêtre qui s'ouvre, laissez le champ Nom de l'enregistrement vide, dans le champ Value / Route traffic to, sélectionnez "Alias to Elastic Beanstalk environment", dans le champ suivant, sélectionnez la région, dans le champ Record type, sélectionnez "A-records" et cliquez sur le bouton "Define simple record" dans le coin inférieur de la fenêtre
<img src = " denjoy.storage.yandexcloud.net/snowplow1/image7.png " alt = "image7"
- Après avoir fermé la fenêtre, cliquez sur le bouton "Créer des enregistrements"
Maintenant, si vous ouvrez dans un navigateur http://snowplow.denjoy.ru/i
, vous devriez voir le même pixel que lors de l'ouverture de la page Clojure Collector. Ainsi, le routage de domaine fonctionne!
Mais nous n'avons toujours pas terminé.
Configurer HTTPS pour Clojure Collector
() SSL- AWS Load Balancer. , Route 53, . SSL
- Services AWS Certificate Manager. «Provision certificates» «Get started»
- «Request a public certificate»
- , .
snowplow.denjoy.ru
«Next» - «DNS validation»
- Tags
- «Review» «Confirm and request»
- . , AWS , «Create record in Route 53»
- «Create»
Create . «Continue» . 30 , !
Load Balancer HTTPS
- Elastic Beanstalk, «Configuration». !
- «Load balancer» «Edit»
- «Listeners» «Add listener»
- Port 443, «Add».
- «Apply»
!
Snowplow Clojure Collector (, ).
, , .
— . Route 53, .
- Clojure Collector, Elastic Beanstalk.
- , Amazon Route 53.
- SSL- .
- Tomcat S3. S3 .
2:
Android Tracker . Tracker Demo, , , «Ok» .
, https://snowplow.denjoy.ru, HTTPS «Start». .
.
Clojure Collector, Elastic Beanstalk, Tomcat S3. , S3
S3 elasticbeanstalk-region-id
. resources / environment / logs / publish / (some ID) / (some ID)
. Some ID – , , e-ab12cd23ef
, , , i-1234567890
. gzip.
, _var_log_tomcat8_rotated_localhost_access_log.txt123456789.gz
– , ETL .
, . HTTP- 200
. , , Clojure Collector . . :
, JSON .
3. ETL
- Clojure Collector.
- IAM, 0 .
.
, , AWS Elastic MapReduce (EMR).
- Tomcat.
- , IP-.
- , schema JSON.
- , , Amazon Redshift.
. , ETL S3-. , , . Tomcat , , .
Java- EmrEtlRunner . ETL Amazon Elastic MapReduce. , EmrEtlRunner . , , , 60 .
EmrEtlRunner
ETL — Unix, . , , snowplow_emr_rXX
, XX — . snowplow_emr_r117_biskupin.zip
.
- ZIP-
snowplow-emr-etl-runner
. . - Snowplow Github , SQL, .
- , ,
snowplow-emr-etl-runner
, :
git clone https://github.com/snowplow/snowplow.git
-
snowplow-emr-etl-runner
snowplow . -
config
targets
. - :
-
snowplow/3-enrich/emr-etl-runner/config/config.yml.sample
config/config.yml
. -
snowplow/3-enrich/config/iglu_resulver.json
config/iglu_resulver.json
. -
snowplow/4-storage/config/targets/redshift.json
config/targets/redshift.json
.
-
:
|-- snowplow-emr-etl-runner |-- snowplow | |-- -SNOWPLOW GIT REPO HERE- |-- config | |-- iglu_resolver.json | |-- config.yml | |-- targets | | |-- redshift.json
EC2
Amazon EC2. ETL Amazon, Amazon EC2. ETL , , .
- AWS Services EC2. «Key Pairs» .
- , , . .
- , , «Create key pair».
- .
denjoy-snowplow
. - pem
- , , <key pair name>.pem .
S3
Amazon S3. ETL.
:
:raw:in
— . -elasticbeanstalk
, Clojure Collector’, Elastic Beanstalk.:processin
— .:archive
— ::raw
( ),:enriched
( ):shredded
( ).:enriched
— ::good
( ),:bad
( , ).:shredded
— ::good
( , ),:bad
( , ).:log
— , ETL.
, S3, Services AWS S3.
:raw:in
, elasticbeanstalk-
.
, « » ETL.
«Create bucket» , denjoy-snowplow-data
. S3, snowplow. «Next» , , , «Create bucket».
, . :
«Create folder» :
archive
shredded
enriched
archive
:
raw
enriched
shredded
, enriched
, shredded
, :
good
bad
, , :
|-- elasticbeanstalk-region-id |-- denjoy-snowplow-data | |-- archive | | |-- raw | | |-- enriched | | |-- shredded | |-- encriched | | |-- good | | |-- bad | |-- shredded | | |-- good | | |-- bad
S3 denjoy-snowplow-log
. , ETL.
EmrEtlRunner
EmrEtlRunner. config.yml
, snowplow config/
. :
-
snowplow-setup
, 0. , AWS IAM.
- AWS. ,
Python/pip
, Mac OS X, Homebrew. , Homebrew,brew install awscli
AWS.
, awscli
, aws configure
. , , , , eu-west-1
.
$ aws configure AWS Access Key ID: <enter your IAM user Access Key ID here> AWS Secret Access Key: <enter you IAM user Secret Access Key here> Default region name: <enter the region name, e.g. eu-west-1 here> Default output format: <just press enter>
aws configure
aws emr create-default-rules
. - EmrEtlRunner, EC2.
EmrEtlRunner!
EmrEtlRunner
EmrEtlRunner — snowplow-emr-etl-runner
.
EmrEtlRunner . . . , 13, rdb_load. . .
EmrEtlRunner config.yml
, config
. , , , .
aws: access_key_id: AKIAIBAWU2NAYME55123 secret_access_key: iEmruXM7dSbOemQy63FhRjzhSboisP5TcJlj9123 s3: region: eu-west-1 buckets: assets: s3://snowplow-hosted-assets jsonpath_assets: log: s3://simoahava-snowplow-log raw: in: - s3://elasticbeanstalk-eu-west-1-375284143851/resources/environments/logs/publish/e-f4pdn8dtsg processing: s3://simoahava-snowplow-data/processing archive: s3://simoahava-snowplow-data/archive/raw enriched: good: s3://simoahava-snowplow-data/enriched/good bad: s3://simoahava-snowplow-data/enriched/bad errors: archive: s3://simoahava-snowplow-data/archive/enriched shredded: good: s3://simoahava-snowplow-data/shredded/good bad: s3://simoahava-snowplow-data/shredded/bad errors: archive: s3://simoahava-snowplow-data/archive/shredded emr: ami_version: 5.9.0 region: eu-west-1 jobflow_role: EMR_EC2_DefaultRole service_role: EMR_DefaultRole placement: ec2_subnet_id: subnet-d6e91a9e ec2_key_name: simoahava bootstrap: [] software: hbase: lingual: jobflow: job_name: Snowplow ETL master_instance_type: m1.medium core_instance_count: 2 core_instance_type: m1.medium core_instance_ebs: volume_size: 100 volume_type: "gp2" volume_iops: 400 ebs_optimized: false task_instance_count: 0 task_instance_type: m1.medium task_instance_bid: 0.015 bootstrap_failure_tries: 3 configuration: yarn-site: yarn.resourcemanager.am.max-attempts: "1" spark: maximizeResourceAllocation: "true" additional_info: collectors: format: clj-tomcat enrich: versions: spark_enrich: 1.12.0 continue_on_unexpected_error: false output_compression: NONE storage: versions: rdb_loader: 0.14.0 rdb_shredder: 0.13.0 hadoop_elasticsearch: 0.1.0 monitoring: tags: {} logging: level: DEBUG
, , , . -. , , .
:aws:access_key_id
|
IAM. |
:aws:secret_access_key
|
IAM. |
:aws:s3:region
|
, S3. |
:aws:s3:buckets:log
|
S3, ETL. |
-:aws:s3:buckets:raw:in
|
, Tomcat. . ! , ! |
:aws:s3:buckets:raw:processing
|
. |
:aws:s3:buckets:raw:archive
|
. |
:aws:s3:buckets:enriched:good
|
. |
:aws:s3:buckets:enriched:bad
|
. |
:aws:s3:buckets:enriched:errors
|
. |
:aws:s3:buckets:enriched:archive
|
. |
:aws:s3:buckets:shredded:good
|
. |
:aws:s3:buckets:shredded:bad
|
. |
:aws:s3:buckets:shredded:errors
|
. |
:aws:s3:buckets:shredded:archive
|
|
:aws:emr:region
|
, EC2. |
:aws:emr:placement
|
. |
:aws:emr:ec2_subnet_id
|
VDS, . , EC2, . |
:aws:emr:ec2_key_name
|
EC2. |
:collectors:format
|
clj-tomcat. |
:monitoring:snowplow
|
(:method , :app_id :collector ). |
.
-, :aws:s3:buckets:raw:in
. . , . , .
:aws:emr:ec2_subnet_id
, Services AWS EC2. «Instances», . «subnet» aws:emr:ec2_subnet_id
.
, .
, , , snowplow-emr-etl-runner
.
./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json
Invalid InstanceProfile: EMR_EC2_DefaultRule.
ETL S3. .
ETL, AWS Redshift, !
snowplow-emr-etl-runner
.- S3-.
- ETL S3.
4: Redshift
- ETL .
- S3-.
- GUI SQL-. Table Plus, , . .
Redshift. Redshift — , AWS. , , Tomcat. SQL . , SQL, Codecademy, SQL!
:
- Redshift.
- .
- EmrEtlRunner Redshift.
, , EmrEtlRunner, . SQL- ( ) Snowplow: .
AWS Amazon Redshift.
, ( , ). «Launch Cluster».
. snowplow-cluster
. . snowplow
.
Node type dc2.large
, Cluster type Single Node 1 .
- (5439).
-. , , . - — .
-.
, «Create cluster».
.
. Redshift.
, , , .
«Clusters» , .
«Properties» «Network and security» VPC security groups ( sg-c3f5c687
).
EC2.
.
«Inbound rules» , TCP- 5439
0.0.0.0/0
. , TCP- ( ).
, .
. Amazon Redshift . .
SQL. Table Plus. «Create new connection» :
- : Amazon Redshift (
com.amazon.redshift.jdbc.Driver
) - Host:
endpoint
- User:
awsuser
- Password:
master_password
- Database:
snowplow
-, .
:
«Connect», .
SELECT current_database();
«Run current», , . :
– !
-, , Android Tracker. .sql , DDL, .
.sql , Snowplow:
- snowplow/4-storage/redshift-storage/sql/atomic-def.sql
- snowplow/4-storage/redshift-storage/sql/manifest-def.sql
atomic-def.sql
Table Plus. atomic
atomic.events
.
manifest-def.sql
. .
DDL . , ETL , .
.sql :
- https://github.com/snowplow/iglu-central/tree/master/sql/com.snowplowanalytics.mobile
- https://github.com/snowplow/iglu-central/tree/master/sql/com.snowplowanalytics.snowplow
, SQL- , :
SELECT * FROM pg_tables WHERE schemaname='atomic';
:
storageloader
, ETL.power_user
, , -.read_only
, .
SQL-. ($password
) , + .
CREATE USER storageloader PASSWORD '$password'; GRANT USAGE ON SCHEMA atomic TO storageloader; GRANT INSERT ON ALL TABLES IN SCHEMA atomic TO storageloader; CREATE USER read_only PASSWORD '$password'; GRANT USAGE ON SCHEMA atomic TO read_only; GRANT SELECT ON ALL TABLES IN SCHEMA atomic TO read_only; CREATE SCHEMA scratchpad; GRANT ALL ON SCHEMA scratchpad TO read_only; CREATE USER power_user PASSWORD '$password'; GRANT ALL ON DATABASE snowplow TO power_user; GRANT ALL ON SCHEMA atomic TO power_user; GRANT ALL ON ALL TABLES IN SCHEMA atomic TO power_user;
, 12 .
, , atomic
storageLoader
, .
, :
SELECT 'ALTER TABLE atomic.' || tablename ||' OWNER TO storageloader;' FROM pg_tables WHERE schemaname='atomic' AND NOT tableowner='storageloader';
:
ALTER TABLE atomic.* OWNER TO storageloader;
.
,
SELECT * FROM pg_tables WHERE schemaname='atomic' AND tableowner='storageloader';
.
, EmrEtlRunner ETL, storageloader
- S3 Redshift.
IAM-
EmrEtlRunner Redshift RDB Loader ( ). , IAM-, Redshift S3-.
- , AWS Services IAM.
- Rules. «Create rule».
- «Select type of trusted entity» AWS - Redshift . «Select your use case» «Redshift — Customizable «Next: permissions».
- AmazonS3ReadOnlyAccess . «Next: Tags».
- «Next: review»
- , ,
RedshiftS3Access
«Create Rule». - . RedshiftS3Access , . Rule ARN. .
- Amazon Redshift .
- Snowplow « IAM».
- «Available IAM rules» , «Add IAM rule» «Done», .
Redshift
, 3, config/
targets/
redshift.json
.
redshift.json
, :
{ "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0", "data": { "name": "AWS Redshift enriched events storage", "host": "ADD HERE", "database": "ADD HERE", "port": 5439, "sslMode": "DISABLE", "username": "ADD HERE", "password": "ADD HERE", "roleArn": "ADD HERE", "schema": "atomic", "maxError": 1, "compRows": 20000, "sshTunnel": null, "purpose": "ENRICHED_EVENTS" } }
, :
host
: URL- Redshiftdatabase
:username
:storageloader
password
:storageloader
ruleArn
: ARN IAM-, .
-.
EmrEtlRunner
, , EmrEtlRunner,
Redshift.
, ( snowplow-emr-etl-runner
):
./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json -t config/targets
:raw:in
(, Tomcat)
, , Redshift. ,
.
- :
read_only .
, , , , (
), ,
, Snowplow.
- Amazon, , DNS
AWS.
- Clojure Collector — , HTTP- Tomcat
S3-.
- ETL, ,
S3.
- , ETL , ,
AWS Redshift.
, , , - –
, -.
, , , .
Discourse
Snowplow — , , .
!