Comment créer et utiliser des dictionnaires dans ClickHouse



Si vous avez ouvert ce Porte article, alors vous avez certainement déjà traité avec ClickHouse et vous pouvez manquer des détails intéressants sur sa commodité et sa rapidité, et aller droit au but - en fait, comment créer des dictionnaires et les utiliser dans ClickHouse.



Que sont les dictionnaires dans ClickHouse?



keyvalue. ClickHouse, ETL- .

ClickHouse :



  • ClickHouse .
  • TTL – ClickHouse .
  • ClickHouse — XML- DDL-.




: / , HTTP(s) , ..



xml-, dictionaries_config ClickHouse.



, dictionaries_lazy_load.



( ) — .



, , system.dictionaries, :



  • ;
  • ;
  • , RAM .




xml DDL-. , — DDL-.



xml :



<yandex>
    <!-- ,   -->
    <comment>Some comments</comment>

    <!-- ,    -->
    <include_from>/etc/metrika.xml</include_from>

    <dictionary>
        <!--   -->
    </dictionary>

    ...

    <dictionary>
        <!--   -->
    </dictionary>
</yandex>


DDL-, .



:



<dictionary>
    <name>clients</name>
    <surce>
        <clickhouse>
            <host>myHostName</host>
            <port>9000</port>
            <user>admin</user>
            <password>secret_password</password>
            <db>clients</db>
            <table>users</table>
            <where>id<=10</where>
        </clickhouse>
    </surce>
    <lifetime>
        <min>3600</min>
        <max>5400</max>
    </lifetime>
    <layout>
        <flat/>
    </layout>
    <structure>
        <id>user_id</id>
        <attribute>
            <name>username</name>
            <type>string</type>
        </attribute>
        <attribute>
            <name>age</name>
            <type>Int8</type>
        </attribute>
    </structure>
</dictionary>


:



  • name — ;
  • source — ;
  • lifetime — ;
  • layout — . ;
  • structure — . , .


DDL-:



CREATE DICTIONARY dict_users_id (
    id UInt64,
    username String,
    email String,
    status UInt16,
    hash String
)
PRIMARY KEY id
SOURCE(MYSQL(
    port 3306
    user clickhouse
    password secret_password
    replica(host 'mysql1.fevlake.com' priority 1)
    db fevlake_dicts
    table users
))
LAYOUT(HASHED())
LIFETIME(MIN 3600 MAX 5400);




. — :



  • HTTP(s)


— , .





:



<surce>
    <file>
      <path>/opt/dictionaries/clients.csv</path>
      <format>CSV</format>
    </file>
</surce>


:



  • path — .
  • format — . ClickHouse.


DDL-:



SOURCE(FILE(path '/opt/dictionaries/clients.csv' format 'CSV'))
SETTINGS(format_csv_allow_single_quotes = 0)




MySQL .



:



<surce>
    <mysql>
        <port>3306</port>
        <user>clickhouse</user>
        <password>secret_password</password>
        <replica>
            <host>example01-1</host>
            <priority>1</priority>
        </replica>
        <replica>
            <host>example01-2</host>
            <priority>1</priority>
        </replica>
        <db>db_name</db>
        <table>table_name</table>
        <where>id=10</where>
        <invalidate_query>SQL_QUERY</invalidate_query>
    </mysql>
</surce>


  • port — MySQL. <replica>.
  • user — MySQL. <replica>.
  • password — MySQL. <replica>.
  • replica — . .
  • db — .
  • table — .
  • where — .  WHERE  MySQL, , id >= 3 AND id < 10 ( ).
  • invalidate_query — ( ).


DDL-:



SOURCE(MYSQL(
    port 3306
    user clickhouse
    password secret_password
    replica(host 'mysql1.fevlake.com' priority 1)
    db fevlake_dicts
    table users
))




ClickHouse:



  • flat
  • hashed
  • sparse_hashed
  • cache
  • direct
  • range_hashed
  • complex_key_hashed
  • complex_key_cache
  • complex_key_direct
  • ip_trie


3, , — flat, hashed complex_key_hashed. .



Flat



, . UInt64 500 000, ClickHouse .



.



:



<layout>
    <flat/>
</layout>




LAYOUT(FLAT())


Hashed



- . , .



:



<layout>
    <hashed/>
</layout>




LAYOUT(HASHED())


omplex_key_hashed



 .  hashed .



:



<layout>
    <hashed/>
</layout>




LAYOUT(COMPLEX_KEY_HASHED())




 <structure>  , .



XML:



<structure>
    <id>user_id</id>
    <attribute>
        <name>username</name>
        <type>string</type>
    </attribute>
    <attribute>
        <name>age</name>
        <type>Int8</type>
    </attribute>
</structure>


:



  • <id> —  ;
  • <attribute> —  . .




ClickHouse :



  • UInt64.  <id>   PRIMARY KEY.
  • . .  <key>   PRIMARY KEY.




UInt64.



:



<id>
    <name>user_id</name>
</id>




CREATE DICTIONARY (
    user_id UInt64,
    ...
)
PRIMARY KEY user_id
...


  • PRIMARY KEY – .




(tuple) .  layout   complex_key_hashed  complex_key_cache.



 <key>. ,   . :



<key>
    <attribute>
        <name>field1</name>
        <type>String</type>
    </attribute>
    <attrbute>
        <name>field2</name>
        <type>UInt32</type>
    </attribute>
    ...
</key>




CREATE DICTIONARY ( field1 String, field2 String ... )
PRIMARY KEY field1, field2
...




<structure>
    ...
    <attribute>
        <name>Name</name>
        <type>ClickHouseDataType</type>
        <null_value></null_value>
        <expression>rand64()</expression>
        <hierarchical>true</hierarchical>
        <injective>true</injective>
        <is_object_id>true</is_object_id>
    </attribute>
</structure>




CREATE DICTIONARY somename (
    Name ClickHouseDataType DEFAULT '' EXPRESSION rand64() HIERARCHICAL INJECTIVE IS_OBJECT_ID
)


ClickHouse



ClickHouse — IP (v4) .



, : ip String country String. GeoIP2 MaxMind.



MaxMind .mmdb API .



ClickHouse .mmdb, – MaxMind CSV, .



IP , :



  • GeoIP2-Country-Blocks-IPv4.csv – IP ID ;
  • GeoIP2-Country-Locations-en.csv – .


, DDL:



CREATE DICTIONARY dicts.geoip_country_blocks_ipv4 (
    network String DEFAULT '',
    geoname_id UInt64 DEFAULT 0,
    registered_country_geoname_id UInt64 DEFAULT 0,
    represented_country_geoname_id UInt64 DEFAULT 0,
    is_anonymous_proxy UInt8 DEFAULT 0,
    is_satellite_provider UInt8 DEFAULT 0
) 
PRIMARY KEY network
SOURCE(FILE(
    path '/var/lib/clickhouse/user_files/GeoIP2-Country-Blocks-IPv4.csv'
    format 'CSVWithNames'
))
LAYOUT(IP_TRIE())
LIFETIME(300);


geoip_country_blocks_ipv4 :



  • network – IP , .
  • geoname_id – ID .


– CSV.



ClickHouse ID, ip_trie. IP .



geoip_country_locations_en:



CREATE DICTIONARY dicts.geoip_country_locations_en (
    geoname_id UInt64 DEFAULT 0,
    locale_code String DEFAULT '',
    continent_code String DEFAULT '',
    continent_name String DEFAULT '',
    country_iso_code String DEFAULT '',
    country_name String DEFAULT '',
    is_in_european_union UInt8 DEFAULT 0
)
PRIMARY KEY geoname_id
SOURCE(FILE(
    path '/var/lib/clickhouse/user_files/GeoIP2-Country-Locations-en.csv'
    format 'CSVWithNames'
))
LAYOUT(HASHED())
LIFETIME(300);


ID . GeoIP2-Country-Locations-en.csv :



  • geoname_id – ID , , .
  • country_name – .


hashed.



CSV .



, :



CREATE TEMPORARY TABLE user_visits (user_ip String, user_id UUID);
---
INSERT INTO user_visits VALUES
    ('178.248.237.68', generateUUIDv4()),
    ('82.192.95.175', generateUUIDv4());


. – dictGet*:



SELECT 
    dictGetString('dicts.geoip_city_locations_en', 'country_name', users_country_id) AS users_country, 
    uniqs
FROM (
    SELECT 
        dictGetUInt64('dicts.geoip_country_blocks_ipv4', 'geoname_id', tuple(IPv4StringToNum(user_ip))) AS users_country_id, 
        uniq(user_id) AS uniqs
    FROM user_visits
    GROUP BY users_country_id
);


:



┌─users_country─┬─uniqs─┐
│ Russia        │     1 │
│ Netherlands   │     1 │
└───────────────┴───────┘
2 rows in set. Elapsed: 0.003 sec.


:



  1. user_ip , ip_trie-: tuple(IPv4StringToNum(user_ip));
  2. , ID users_country_id: dictGetUInt64('geoip_country_blocks_ipv4', 'geoname_id', ...) as users_country_id;
  3. : uniq(user_id) as uniq_users;
  4. ID , : GROUP BY users_country_id;
  5. , ID , : dictGetString('geoip_city_locations_en', 'country_name', users_country_id) AS users_country.


. GeoIP2 , :)





. , ClickHouse .




All Articles