監控系統健康狀況

Watchdog 監視供應商服務和 VHAL 服務的運作狀況,並終止任何不健康的進程。當不健康的進程終止時,Watchdog 會將進程狀態轉儲到/data/anr與其他應用程式無回應 (ANR) 轉儲一樣。這樣做有利於調試過程。

供應商服務健康狀況監控

供應商服務在本機端和 Java 端都受到監控。對於要監視的供應商服務,該服務必須透過指定預先定義的逾時向 Watchdog 註冊健康檢查進程。 Watchdog 透過依照與註冊期間指定的逾時相關的時間間隔對其進行 ping 操作來監視已註冊的運行狀況檢查進程的運作狀況。當 ping 進程在逾時內沒有回應時,該進程被認為是不健康的。

原生服務健康監控

指定 Watchdog AIDL makefile

  1. shared_libs中包含carwatchdog_aidl_interface-ndk_platform

    Android.bp

    cc_binary {
        name: "sample_native_client",
        srcs: [
            "src/*.cpp"
        ],
        shared_libs: [
            "carwatchdog_aidl_interface-ndk_platform",
            "libbinder_ndk",
        ],
        vendor: true,
    }
    

新增 SELinux 策略

  1. 若要新增SELinux策略,請允許供應商服務網域使用binder( binder_use巨集)並將供應商服務網域新增至carwatchdog用戶端網域( carwatchdog_client_domain巨集)。請參閱下面的sample_client.tefile_contexts程式碼:

    sample_client.te

    type sample_client, domain;
    type sample_client_exec, exec_type, file_type, vendor_file_type;
    
    carwatchdog_client_domain(sample_client)
    
    init_daemon_domain(sample_client)
    binder_use(sample_client)
    

    file_contexts

    /vendor/bin/sample_native_client  u:object_r:sample_client_exec:s0
    

透過繼承BnCarWatchdogClient實現一個客戶端類

  1. checkIfAlive中,執行健康檢查。一種選擇是發佈到線程循環處理程序。如果健康,請呼叫ICarWatchdog::tellClientAlive 。請參閱下面的SampleNativeClient.hSampleNativeClient.cpp程式碼:

    SampleNativeClient.h

    class SampleNativeClient : public BnCarWatchdogClient {
    public:
        ndk::ScopedAStatus checkIfAlive(int32_t sessionId, TimeoutLength
            timeout) override;
        ndk::ScopedAStatus prepareProcessTermination() override;
        void initialize();
    
    private:
        void respondToDaemon();
    private:
        ::android::sp<::android::Looper> mHandlerLooper;
        std::shared_ptr<ICarWatchdog> mWatchdogServer;
        std::shared_ptr<ICarWatchdogClient> mClient;
        int32_t mSessionId;
    };
    

    SampleNativeClient.cpp

    ndk::ScopedAStatus WatchdogClient::checkIfAlive(int32_t sessionId, TimeoutLength timeout) {
        mHandlerLooper->removeMessages(mMessageHandler,
            WHAT_CHECK_ALIVE);
        mSessionId = sessionId;
        mHandlerLooper->sendMessage(mMessageHandler,
            Message(WHAT_CHECK_ALIVE));
        return ndk::ScopedAStatus::ok();
    }
    // WHAT_CHECK_ALIVE triggers respondToDaemon from thread handler
    void WatchdogClient::respondToDaemon() {
      // your health checking method here
      ndk::ScopedAStatus status = mWatchdogServer->tellClientAlive(mClient,
            mSessionId);
    }
    

啟動binder線程並註冊客戶端

汽車看門狗守護程式介面名稱為android.automotive.watchdog.ICarWatchdog/default

  1. 搜尋具有名稱的守護程序並呼叫ICarWatchdog::registerClient 。請參閱下面的main.cppSampleNativeClient.cpp程式碼:

    main.cpp

    int main(int argc, char** argv) {
        sp<Looper> looper(Looper::prepare(/*opts=*/0));
    
        ABinderProcess_setThreadPoolMaxThreadCount(1);
        ABinderProcess_startThreadPool();
        std::shared_ptr<SampleNativeClient> client =
            ndk::SharedRefBase::make<SampleNatvieClient>(looper);
    
        // The client is registered in initialize()
        client->initialize();
        ...
    }
    

    SampleNativeClient.cpp

    void SampleNativeClient::initialize() {
        ndk::SpAIBinder binder(AServiceManager_getService(
            "android.automotive.watchdog.ICarWatchdog/default"));
        std::shared_ptr<ICarWatchdog> server =
            ICarWatchdog::fromBinder(binder);
        mWatchdogServer = server;
        ndk::SpAIBinder binder = this->asBinder();
        std::shared_ptr<ICarWatchdogClient> client =
            ICarWatchdogClient::fromBinder(binder)
        mClient = client;
        server->registerClient(client, TimeoutLength::TIMEOUT_NORMAL);
    }
    

Java服務健康監控

透過繼承CarWatchdogClientCallback實現客戶端

  1. 編輯新文件如下:
    private final CarWatchdogClientCallback mClientCallback = new CarWatchdogClientCallback() {
        @Override
        public boolean onCheckHealthStatus(int sessionId, int timeout) {
            // Your health check logic here
            // Returning true implies the client is healthy
            // If false is returned, the client should call
            // CarWatchdogManager.tellClientAlive after health check is
            // completed
        }
    
        @Override
        public void onPrepareProcessTermination() {}
    };
    

註冊客戶

  1. 呼叫CarWatchdogManager.registerClient()
    private void startClient() {
        CarWatchdogManager manager =
            (CarWatchdogManager) car.getCarManager(
            Car.CAR_WATCHDOG_SERVICE);
        // Choose a proper executor according to your health check method
        ExecutorService executor = Executors.newFixedThreadPool(1);
        manager.registerClient(executor, mClientCallback,
            CarWatchdogManager.TIMEOUT_NORMAL);
    }
    

註銷客戶端

  1. 服務完成後呼叫CarWatchdogManager.unregisterClient()
    private void finishClient() {
        CarWatchdogManager manager =
            (CarWatchdogManager) car.getCarManager(
            Car.CAR_WATCHDOG_SERVICE);
        manager.unregisterClient(mClientCallback);
    }
    

VHAL健康監測

與供應商服務運作狀況監控不同,Watchdog 透過訂閱VHAL_HEARTBEAT車輛屬性來監控 VHAL 服務運作狀況。 Watchdog 期望該屬性的值每 N 秒更新一次。當心跳在該超時時間內沒有更新時,Watchdog 將終止 VHAL 服務。

注意:只有當 VHAL 服務支援VHAL_HEARTBEAT車輛屬性時,Watchdog 才會監控 VHAL 服務運作狀況。

VHAL 內部實作可能因供應商而異。使用以下程式碼範例作為參考。

  1. 註冊VHAL_HEARTBEAT車輛屬性。

    啟動VHAL服務時,註冊VHAL_HEARTBEAT車輛屬性。在下面的範例中, unordered_map將屬性 ID 對應到配置,用於保存所有支援的配置。將VHAL_HEARTBEAT的配置加入到map中,這樣當查詢VHAL_HEARTBEAT時,就會傳回對應的配置。

    void registerVhalHeartbeatProperty() {
            const VehiclePropConfig config = {
                    .prop = toInt(VehicleProperty::VHAL_HEARTBEAT),
                    .access = VehiclePropertyAccess::READ,
                    .changeMode = VehiclePropertyChangeMode::ON_CHANGE,
            };
           // mConfigsById is declared as std::unordered_map<int32_t, VehiclePropConfig>.
           mConfigsById[config.prop] = config;
    }
    
  2. 更新VHAL_HEARTBEAT車輛屬性。

    根據 VHAL 運行狀況檢查頻率(在「定義 VHAL 運行狀況檢查的頻率」中進行了說明),每 N 秒更新一次VHAL_HEARTBEAT車輛屬性。實現此目的的一種方法是使用RecurrentTimer調用檢查VHAL 運行狀況的操作,並在超時內更新VHAL_HEARTBEAT車輛屬性。

    下面顯示的是使用RecurrentTimer的範例實作:

    int main(int argc, char** argv) {
            RecurrentTimer recurrentTimer(updateVhalHeartbeat);
            recurrentTimer.registerRecurrentEvent(kHeartBeatIntervalNs,
                                               static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT));
            … Run service …
            recurrentTimer.unregisterRecurrentEvent(
                    static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT));
    }
    
    void updateVhalHeartbeat(const std::vector<int32_t>& cookies) {
           for (int32_t property : cookies) {
                  if (property != static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT)) {
                         continue;
                  }
    
                  // Perform internal health checking such as retrieving a vehicle property to ensure
                  // the service is responsive.
                  doHealthCheck();
    
                  // Construct the VHAL_HEARTBEAT property with system uptime.
                  VehiclePropValuePool valuePool;
                  VehicleHal::VehiclePropValuePtr propValuePtr = valuePool.obtainInt64(uptimeMillis());
                  propValuePtr->prop = static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT);
                  propValuePtr->areaId = 0;
                  propValuePtr->status = VehiclePropertyStatus::AVAILABLE;
                  propValuePtr->timestamp = elapsedRealtimeNano();
    
                  // Propagate the HAL event.
                  onHalEvent(std::move(propValuePtr));
           }
    }
    
  3. 可選)定義 VHAL 健康檢查的頻率。

    Watchdog 的ro.carwatchdog.vhal_healthcheck.interval只讀產品屬性定義了 VHAL 運作狀況檢查頻率。預設運轉狀況檢查頻率(未定義此屬性時)為三秒。如果三秒不足以讓 VHAL 服務更新VHAL_HEARTBEAT車輛屬性,請根據服務響應能力定義 VHAL 運行狀況檢查頻率。

調試看門狗終止的不健康進程

看門狗轉儲進程狀態並終止不健康的進程。終止不健康的進程時,Watchdog 會將文字carwatchdog terminated <process name> (pid:<process id>)記錄到 logcat 中。此日誌行提供有關已終止進程的信息,例如進程名稱和進程 ID。

  1. 可以透過執行以下指令在 logcat 中搜尋上述文字:
    $ adb logcat -s CarServiceHelper | fgrep "carwatchdog killed"
    

    例如,當 KitchenSink 應用程式是註冊的 Watchdog 用戶端且對 Watchdog ping 無回應時,Watchdog 在終止註冊的 KitchenSink 進程時會記錄如下行。

    05-01 09:50:19.683   578  5777 W CarServiceHelper: carwatchdog killed com.google.android.car.kitchensink (pid: 5574)
    
  2. 要確定無回應的根本原因,請使用儲存在/data/anr進程轉儲,就像用於活動 ANR 案例一樣。若要檢索已終止進程的轉儲文件,請使用下列命令。
    $ adb root
    $ adb shell grep -Hn "pid process_pid" /data/anr/*
    

    以下範例輸出特定於 KitchenSink 應用程式:

    $ adb shell su root grep -Hn "pid 5574" /data/anr/*.
    
    /data/anr/anr_2020-05-01-09-50-18-290:3:----- pid 5574 at 2020-05-01 09:50:18 -----
    /data/anr/anr_2020-05-01-09-50-18-290:285:----- Waiting Channels: pid 5574 at 2020-05-01 09:50:18 -----
    

    已終止的 KitchenSink 程序的轉儲檔案位於/data/anr/anr_2020-05-01-09-50-18-290 。使用終止進程的 ANR 轉儲檔案開始分析。