Monitor system health

Watchdog monitors the health of vendor services and the VHAL service, and terminates any unhealthy process. When an unhealthy process is terminated, the Watchdog dumps the process status to /data/anr as with other Application Not Responding (ANR) dumps. Doing so facilitates the debugging process.

Vendor service health monitoring

Vendor services are monitored at both native and Java side. For a Vendor service to be monitored, the service must register a health checking process with the Watchdog by specifying a pre-defined timeout. Watchdog monitors the health of a registered health checking process by pinging it at an interval relative to the timeout that is specified during the registration. When a pinged process doesn't respond within the timeout, the process is considered unhealthy.

Native service health monitoring

Specify the Watchdog AIDL makefile

  1. Include carwatchdog_aidl_interface-ndk_platform in shared_libs.

    Android.bp

    cc_binary {
        name: "sample_native_client",
        srcs: [
            "src/*.cpp"
        ],
        shared_libs: [
            "carwatchdog_aidl_interface-ndk_platform",
            "libbinder_ndk",
        ],
        vendor: true,
    }

Add an SELinux policy

  1. To add an SELinux policy, allow the vendor service domain to use binder (binder_use macro) and add the vendor service domain to the carwatchdog client domain (carwatchdog_client_domain macro). See the code below for sample_client.te and file_contexts:

    sample_client.te

    type sample_client, domain;
    type sample_client_exec, exec_type, file_type, vendor_file_type;
    
    carwatchdog_client_domain(sample_client)
    
    init_daemon_domain(sample_client)
    binder_use(sample_client)

    file_contexts

    /vendor/bin/sample_native_client  u:object_r:sample_client_exec:s0

Implement a client class by inheriting BnCarWatchdogClient

  1. In checkIfAlive, perform a health check. One option is to post to the thread loop handler. If healthy, call ICarWatchdog::tellClientAlive. See the code below for SampleNativeClient.h and SampleNativeClient.cpp:

    SampleNativeClient.h

    class SampleNativeClient : public BnCarWatchdogClient {
    public:
        ndk::ScopedAStatus checkIfAlive(int32_t sessionId, TimeoutLength
            timeout) override;
        ndk::ScopedAStatus prepareProcessTermination() override;
        void initialize();
    
    private:
        void respondToDaemon();
    private:
        ::android::sp<::android::Looper> mHandlerLooper;
        std::shared_ptr<ICarWatchdog> mWatchdogServer;
        std::shared_ptr<ICarWatchdogClient> mClient;
        int32_t mSessionId;
    };

    SampleNativeClient.cpp

    ndk::ScopedAStatus WatchdogClient::checkIfAlive(int32_t sessionId, TimeoutLength timeout) {
        mHandlerLooper->removeMessages(mMessageHandler,
            WHAT_CHECK_ALIVE);
        mSessionId = sessionId;
        mHandlerLooper->sendMessage(mMessageHandler,
            Message(WHAT_CHECK_ALIVE));
        return ndk::ScopedAStatus::ok();
    }
    // WHAT_CHECK_ALIVE triggers respondToDaemon from thread handler
    void WatchdogClient::respondToDaemon() {
      // your health checking method here
      ndk::ScopedAStatus status = mWatchdogServer->tellClientAlive(mClient,
            mSessionId);
    }

Start a binder thread and register the client

The car watchdog daemon interface name is android.automotive.watchdog.ICarWatchdog/default.

  1. Search for the daemon with the name and call ICarWatchdog::registerClient. See the code below for main.cpp and SampleNativeClient.cpp:

    main.cpp

    int main(int argc, char** argv) {
        sp<Looper> looper(Looper::prepare(/*opts=*/0));
    
        ABinderProcess_setThreadPoolMaxThreadCount(1);
        ABinderProcess_startThreadPool();
        std::shared_ptr<SampleNativeClient> client =
            ndk::SharedRefBase::make<SampleNatvieClient>(looper);
    
        // The client is registered in initialize()
        client->initialize();
        ...
    }

    SampleNativeClient.cpp

    void SampleNativeClient::initialize() {
        ndk::SpAIBinder binder(AServiceManager_getService(
            "android.automotive.watchdog.ICarWatchdog/default"));
        std::shared_ptr<ICarWatchdog> server =
            ICarWatchdog::fromBinder(binder);
        mWatchdogServer = server;
        ndk::SpAIBinder binder = this->asBinder();
        std::shared_ptr<ICarWatchdogClient> client =
            ICarWatchdogClient::fromBinder(binder)
        mClient = client;
        server->registerClient(client, TimeoutLength::TIMEOUT_NORMAL);
    }

Java service health monitoring

Implement a client by inheriting CarWatchdogClientCallback

  1. Edit the new file as follows:
    private final CarWatchdogClientCallback mClientCallback = new CarWatchdogClientCallback() {
        @Override
        public boolean onCheckHealthStatus(int sessionId, int timeout) {
            // Your health check logic here
            // Returning true implies the client is healthy
            // If false is returned, the client should call
            // CarWatchdogManager.tellClientAlive after health check is
            // completed
        }
    
        @Override
        public void onPrepareProcessTermination() {}
    };

Register the client

  1. Call CarWatchdogManager.registerClient():
    private void startClient() {
        CarWatchdogManager manager =
            (CarWatchdogManager) car.getCarManager(
            Car.CAR_WATCHDOG_SERVICE);
        // Choose a proper executor according to your health check method
        ExecutorService executor = Executors.newFixedThreadPool(1);
        manager.registerClient(executor, mClientCallback,
            CarWatchdogManager.TIMEOUT_NORMAL);
    }

Unregister the client

  1. Call CarWatchdogManager.unregisterClient() when the service is finished:
    private void finishClient() {
        CarWatchdogManager manager =
            (CarWatchdogManager) car.getCarManager(
            Car.CAR_WATCHDOG_SERVICE);
        manager.unregisterClient(mClientCallback);
    }

VHAL health monitoring

Unlike vendor service health monitoring, Watchdog monitors the VHAL service health by subscribing to the VHAL_HEARTBEAT vehicle property. Watchdog expects the value of this property to be updated once every N seconds. When the heartbeat is not updated within this timeout, Watchdog terminates the VHAL service.

Note: Watchdog monitors the VHAL service health only when the VHAL_HEARTBEAT vehicle property is supported by the VHAL service.

VHAL internal implementation can vary by vendor. Use the following code samples as references.

  1. Register the VHAL_HEARTBEAT vehicle property.

    When starting the VHAL service, register the VHAL_HEARTBEAT vehicle property. In the below example, an unordered_map, which maps property ID to config is used to hold all supported configs. Config for VHAL_HEARTBEAT is added to the map, so that when VHAL_HEARTBEAT is queried, the corresponding config is returned.

    void registerVhalHeartbeatProperty() {
            const VehiclePropConfig config = {
                    .prop = toInt(VehicleProperty::VHAL_HEARTBEAT),
                    .access = VehiclePropertyAccess::READ,
                    .changeMode = VehiclePropertyChangeMode::ON_CHANGE,
            };
           // mConfigsById is declared as std::unordered_map<int32_t, VehiclePropConfig>.
           mConfigsById[config.prop] = config;
    }
  2. Update VHAL_HEARTBEAT vehicle property.

    Based on the VHAL health check frequency (explained in Define the frequency of VHAL health check"), update the VHAL_HEARTBEAT vehicle property once every N seconds. One way to do this is by using the RecurrentTimer to call the action that checks the VHAL health and updates the VHAL_HEARTBEAT vehicle property within timeout.

    Shown below is a sample implementation using RecurrentTimer:

    int main(int argc, char** argv) {
            RecurrentTimer recurrentTimer(updateVhalHeartbeat);
            recurrentTimer.registerRecurrentEvent(kHeartBeatIntervalNs,
                                               static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT));
             Run service 
            recurrentTimer.unregisterRecurrentEvent(
                    static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT));
    }
    
    void updateVhalHeartbeat(const std::vector<int32_t>& cookies) {
           for (int32_t property : cookies) {
                  if (property != static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT)) {
                         continue;
                  }
    
                  // Perform internal health checking such as retrieving a vehicle property to ensure
                  // the service is responsive.
                  doHealthCheck();
    
                  // Construct the VHAL_HEARTBEAT property with system uptime.
                  VehiclePropValuePool valuePool;
                  VehicleHal::VehiclePropValuePtr propValuePtr = valuePool.obtainInt64(uptimeMillis());
                  propValuePtr->prop = static_cast<int32_t>(VehicleProperty::VHAL_HEARTBEAT);
                  propValuePtr->areaId = 0;
                  propValuePtr->status = VehiclePropertyStatus::AVAILABLE;
                  propValuePtr->timestamp = elapsedRealtimeNano();
    
                  // Propagate the HAL event.
                  onHalEvent(std::move(propValuePtr));
           }
    }
  3. (Optional) Define the frequency of VHAL health check.

    Watchdog's ro.carwatchdog.vhal_healthcheck.interval read-only product property defines the VHAL health check frequency. Default health check frequency (when this property is not defined) is three seconds. If three seconds isn't sufficient for the VHAL service to update the VHAL_HEARTBEAT vehicle property, define the VHAL health check frequency depending on the service responsiveness.

Debug unhealthy processes terminated by the Watchdog

Watchdog dumps the process state and terminates unhealthy processes. When terminating an unhealthy process, Watchdog logs the text carwatchdog terminated <process name> (pid:<process id>) to logcat. This log line provides information about the terminated process like the process name and process ID.

  1. The logcat can be searched for the aforementioned text by running:
    $ adb logcat -s CarServiceHelper | fgrep "carwatchdog killed"

    For example, when the KitchenSink app is a registered Watchdog client and becomes unresponsive to Watchdog pings, Watchdog logs a line such as the below line when terminating the registered KitchenSink process.

    05-01 09:50:19.683   578  5777 W CarServiceHelper: carwatchdog killed com.google.android.car.kitchensink (pid: 5574)
  2. To identify the root cause of the unresponsiveness, use the process dump stored at /data/anr just as you would use for activity ANR cases. To retrieve the dump file for the terminated process use the below commands.
    $ adb root
    $ adb shell grep -Hn "pid process_pid" /data/anr/*

    The following sample output is specific to the KitchenSink app:

    $ adb shell su root grep -Hn "pid 5574" /data/anr/*.
    /data/anr/anr_2020-05-01-09-50-18-290:3:----- pid 5574 at 2020-05-01 09:50:18 -----
    /data/anr/anr_2020-05-01-09-50-18-290:285:----- Waiting Channels: pid 5574 at 2020-05-01 09:50:18 -----

    The dump file for the terminated KitchenSink process is located at /data/anr/anr_2020-05-01-09-50-18-290. Start your analysis using the terminated process's ANR dump file.