Car Watchdog

Use the car watchdog to help debug the VHAL. Car watchdog monitors the health of — and kills — unhealthy processes. For a process to be monitored by the car watchdog, the process must be registered with the car watchdog. When the car watchdog kills unhealthy processes, car watchdog writes the status of the processes to data/anr as with other Application Not Responding (ANR) dumps. Doing so facilities the debugging process.

This article describes how vendor HALs and services can register a process with the car watchdog.

Vendor HAL

Typically, the vendor HAL uses a thread pool for hwbinder. However, the car watchdog client communicates with the car watchdog daemon through binder, which differs from hwbinder. Therefore, another thread pool for binder is in use.

Specify car watchdog aidl in makefile

  1. Include carwatchdog_aidl_interface-ndk_platform in shared_libs:

    Android.bp:

    cc_defaults {
        name: "vhal_v2_0_defaults",
        shared_libs: [
            "libbinder_ndk",
            "libhidlbase",
            "liblog",
            "libutils",
            "android.hardware.automotive.vehicle@2.0",
            "carwatchdog_aidl_interface-ndk_platform",
        ],
        cflags: [
            "-Wall",
            "-Wextra",
            "-Werror",
        ],
    }
    

Add an SELinux Policy

  1. Allow system_server to kill your HAL. If you don't have system_server.te, create one. It is strongly recommended you add an SELinux policy to each device.
  2. Allow the vendor HAL to use binder (binder_use macro) and add the vendor HAL to the carwatchdog client domain (carwatchdog_client_domain macro). See the code below for systemserver.te and vehicle_default.te:

    system_server.te

    # Allow system_server to kill vehicle HAL
    allow system_server hal_vehicle_server:process sigkill;
    

    hal_vehicle_default.te

    # Configuration for register VHAL to car watchdog
    carwatchdog_client_domain(hal_vehicle_default)
    binder_use(hal_vehicle_default)
    

Implement a client class by inheriting BnCarWatchdogClient

  1. In checkIfAlive, perform health checking. For example, post to the thread loop handler. If healthy, call ICarWatchdog::tellClientAlive. See the code below for WatchogClient.h and WatchogClient.cpp:

    WatchogClient.h

    class WatchdogClient : public aidl::android::automotive::watchdog::BnCarWatchdogClient {
      public:
        explicit WatchdogClient(const ::android::sp<::android::Looper>& handlerLooper, VehicleHalManager* vhalManager);
    
    ndk::ScopedAStatus checkIfAlive(int32_t sessionId, aidl::android::automotive::watchdog::TimeoutLength timeout) override; ndk::ScopedAStatus prepareProcessTermination() override; };

    WatchogClient.cpp

    ndk::ScopedAStatus WatchdogClient::checkIfAlive(int32_t sessionId, TimeoutLength /*timeout*/) {
        // Implement or call your health check logic here
        return ndk::ScopedAStatus::ok();
    }
    

Start the binder thread and register the client

  1. Create a thread pool for binder communication. If vendor HAL uses hwbinder for its own purpose, you must create another thread pool for car watchdog binder communication).
  2. Search for the daemon with the name and call ICarWatchdog::registerClient. The car watchdog daemon interface name is android.automotive.watchdog.ICarWatchdog/default.
  3. Based on service responsiveness, select one of the three following types of timeout supported by the car watchdog and then pass the timeout in the call to ICarWatchdog::registerClient:
    • critical(3s)
    • moderate(5s)
    • normal(10s)
    See the code below for VehicleService.cpp and WatchogClient.cpp:

    VehicleService.cpp

    int main(int /* argc */, char* /* argv */ []) {
        // Set up thread pool for hwbinder
        configureRpcThreadpool(4, false /* callerWillJoin */);
    
        ALOGI("Registering as service...");
        status_t status = service->registerAsService();
    
        if (status != OK) {
            ALOGE("Unable to register vehicle service (%d)", status);
            return 1;
        }
    
        // Setup a binder thread pool to be a car watchdog client.
        ABinderProcess_setThreadPoolMaxThreadCount(1);
        ABinderProcess_startThreadPool();
        sp<Looper> looper(Looper::prepare(0 /* opts */));
        std::shared_ptr<WatchdogClient> watchdogClient =
                ndk::SharedRefBase::make<WatchdogClient>(looper, service.get());
        // The current health check is done in the main thread, so it falls short of capturing the real
        // situation. Checking through HAL binder thread should be considered.
        if (!watchdogClient->initialize()) {
            ALOGE("Failed to initialize car watchdog client");
            return 1;
        }
        ALOGI("Ready");
        while (true) {
            looper->pollAll(-1 /* timeoutMillis */);
        }
    
        return 1;
    }
    

    WatchogClient.cpp

    bool WatchdogClient::initialize() {
        ndk::SpAIBinder binder(AServiceManager_getService("android.automotive.watchdog.ICarWatchdog/default"));
        if (binder.get() == nullptr) {
            ALOGE("Failed to get carwatchdog daemon");
            return false;
        }
        std::shared_ptr<ICarWatchdog> server = ICarWatchdog::fromBinder(binder);
        if (server == nullptr) {
            ALOGE("Failed to connect to carwatchdog daemon");
            return false;
        }
        mWatchdogServer = server;
    
        binder = this->asBinder();
        if (binder.get() == nullptr) {
            ALOGE("Failed to get car watchdog client binder object");
            return false;
        }
        std::shared_ptr<ICarWatchdogClient> client = ICarWatchdogClient::fromBinder(binder);
        if (client == nullptr) {
            ALOGE("Failed to get ICarWatchdogClient from binder");
            return false;
        }
        mTestClient = client;
        mWatchdogServer->registerClient(client, TimeoutLength::TIMEOUT_NORMAL);
        ALOGI("Successfully registered the client to car watchdog server");
        return true;
    }
    

Vendor Services (Native)

Specify the car watchdog aidl makefile

  1. Include carwatchdog_aidl_interface-ndk_platform in shared_libs.

    Android.bp

    cc_binary {
        name: "sample_native_client",
        srcs: [
            "src/*.cpp"
        ],
        shared_libs: [
            "carwatchdog_aidl_interface-ndk_platform",
            "libbinder_ndk",
        ],
        vendor: true,
    }
    

Add an SELinux policy

  1. To add an SELinux policy, allow the vendor service domain to use binder (binder_use macro) and add the vendor service domain to the carwatchdog client domain (carwatchdog_client_domain macro). See the code below for sample_client.te and file_contexts:

    sample_client.te

    type sample_client, domain;
    type sample_client_exec, exec_type, file_type, vendor_file_type;
    
    carwatchdog_client_domain(sample_client)
    
    init_daemon_domain(sample_client)
    binder_use(sample_client)
    

    file_contexts

    /vendor/bin/sample_native_client  u:object_r:sample_client_exec:s0
    

Implement a client class by inheriting BnCarWatchdogClient

  1. In checkIfAlive, perform a health check. One option is to post to the thread loop handler. If healthy, call ICarWatchdog::tellClientAlive. See the code below for SampleNativeClient.h and SampleNativeClient.cpp:

    SampleNativeClient.h

    class SampleNativeClient : public BnCarWatchdogClient {
    public:
        ndk::ScopedAStatus checkIfAlive(int32_t sessionId, TimeoutLength
            timeout) override;
        ndk::ScopedAStatus prepareProcessTermination() override;
        void initialize();
    
    private:
        void respondToDaemon();
    private:
        ::android::sp<::android::Looper> mHandlerLooper;
        std::shared_ptr<ICarWatchdog> mWatchdogServer;
        std::shared_ptr<ICarWatchdogClient> mClient;
        int32_t mSessionId;
    };
    

    SampleNativeClient.cpp

    ndk::ScopedAStatus WatchdogClient::checkIfAlive(int32_t sessionId, TimeoutLength timeout) {
        mHandlerLooper->removeMessages(mMessageHandler,
            WHAT_CHECK_ALIVE);
        mSessionId = sessionId;
        mHandlerLooper->sendMessage(mMessageHandler,
            Message(WHAT_CHECK_ALIVE));
        return ndk::ScopedAStatus::ok();
    }
    // WHAT_CHECK_ALIVE triggers respondToDaemon from thread handler
    void WatchdogClient::respondToDaemon() {
      // your health checking method here
      ndk::ScopedAStatus status = mWatchdogServer->tellClientAlive(mClient,
            mSessionId);
    }
    

Start a binder thread and register the client

The car watchdog daemon interface name is android.automotive.watchdog.ICarWatchdog/default.

  1. Search for the daemon with the name and call ICarWatchdog::registerClient. See the code below for main.cpp and SampleNativeClient.cpp:

    main.cpp

    int main(int argc, char** argv) {
        sp<Looper> looper(Looper::prepare(/*opts=*/0));
    
        ABinderProcess_setThreadPoolMaxThreadCount(1);
        ABinderProcess_startThreadPool();
        std::shared_ptr<SampleNativeClient> client =
            ndk::SharedRefBase::make<SampleNatvieClient>(looper);
    
        // The client is registered in initialize()
        client->initialize();
        ...
    }
    

    SampleNativeClient.cpp

    void SampleNativeClient::initialize() {
        ndk::SpAIBinder binder(AServiceManager_getService(
            "android.automotive.watchdog.ICarWatchdog/default"));
        std::shared_ptr<ICarWatchdog> server =
            ICarWatchdog::fromBinder(binder);
        mWatchdogServer = server;
        ndk::SpAIBinder binder = this->asBinder();
        std::shared_ptr<ICarWatchdogClient> client =
            ICarWatchdogClient::fromBinder(binder)
        mClient = client;
        server->registerClient(client, TimeoutLength::TIMEOUT_NORMAL);
    }
    

Vendor Services (Android)

Implement a client by inheriting CarWatchdogClientCallback

  1. Edit the new file as follows:
    private final CarWatchdogClientCallback mClientCallback = new CarWatchdogClientCallback() {
        @Override
        public boolean onCheckHealthStatus(int sessionId, int timeout) {
            // Your health check logic here
            // Returning true implies the client is healthy
            // If false is returned, the client should call
            // CarWatchdogManager.tellClientAlive after health check is
            // completed
        }
    
        @Override
        public void onPrepareProcessTermination() {}
    };
    

Register the client

  1. Call CarWatchdogManager.registerClient():
    private void startClient() {
        CarWatchdogManager manager =
            (CarWatchdogManager) car.getCarManager(
            Car.CAR_WATCHDOG_SERVICE);
        // Choose a proper executor according to your health check method
        ExecutorService executor = Executors.newFixedThreadPool(1);
        manager.registerClient(executor, mClientCallback,
            CarWatchdogManager.TIMEOUT_NORMAL);
    }
    

Unregister the client

  1. Call CarWatchdogManager.unregisterClient() when the service is finished:
    private void finishClient() {
        CarWatchdogManager manager =
            (CarWatchdogManager) car.getCarManager(
            Car.CAR_WATCHDOG_SERVICE);
        manager.unregisterClient(mClientCallback);
    }
    

Detect processes terminated by car watchdog

Car watchdog dumps/kills processes (vendor HAL, vendor native services, vendor Android services) that are registered to the car watchdog when they are stuck and unresponsive. Such dumping is detected by checking logcats. The car watchdog outputs a log carwatchdog killed process_name (pid:process_id) when a problematic process is dumped or killed. Therefore:

$ adb logcat -s CarServiceHelper | fgrep "carwatchdog killed"

The relevant logs are captured. For example, if the KitchenSink app (a car watchdog client) becomes stuck, a line such as that below is written to the log:

05-01 09:50:19.683   578  5777 W CarServiceHelper: carwatchdog killed com.google.android.car.kitchensink (pid: 5574)

To determine why or where the KitchenSink app became stuck, use the process dump stored at /data/anr just as you would use Activity ANR cases.

$ adb root
$ adb shell grep -Hn "pid process_pid" /data/anr/*

The following sample output is specific to the KitchenSink app:

$ adb shell su root grep -Hn "pid 5574" /data/anr/*.
/data/anr/anr_2020-05-01-09-50-18-290:3:----- pid 5574 at 2020-05-01 09:50:18 -----
/data/anr/anr_2020-05-01-09-50-18-290:285:----- Waiting Channels: pid 5574 at 2020-05-01 09:50:18 -----

Find the dump file (for example, /data/anr/anr_2020-05-01-09-50-18-290 in the example above) and start your analysis.