/var/log/andrey     About     Archive     Feed

NFS root file system on Fedora

Here are a few notes on making your Fedora workstation serve a root file system for some embedded Linux target. There are a couple of additional steps when compared to Ubuntu, namely telling the firewall to allow NFS traffic and re-enabling NFSv2 support.

First, install the NFS and rpcbind packages:

sudo dnf install -y nfs-utils rpcbind

Enable and then activate services:

sudo systemctl enable rpcbind
sudo systemctl enable nfs-server
sudo systemctl start rpcbind
sudo systemctl start nfs-server
sudo systemctl start rpc-statd
sudo systemctl start nfs-idmapd

Tell the firewall to allow NFS connections:

sudo firewall-cmd --permanent --add-service nfs
sudo firewall-cmd --permanent --add-service rpc-bind
sudo firewall-cmd --permanent --add-service mountd
sudo firewall-cmd --reload

Enable NFSv2 support. This is very important as the Linux kernel explicitly uses the old NFS version 2 protocol when NFS booting, however NFS version 2 is disabled by default in Fedora 22, 23, and so on. Edit /etc/sysconfig/nfs and set RPCNFSDARGS= to “-V 2” in order to enable NFSv2.

Restart services for this to take effect:

sudo systemctl restart nfs-config
sudo systemctl restart nfs
sudo systemctl restart rpcbind

Verify that NFSv2 support is enabled by checking /proc/nfsd/versions,

sudo cat /proc/fs/nfsd/versions
+2 +3 +4 +4.1 +4.2

The +2 indicates NFSv2 support. You should now be able to add your exported file systems to /etc/exports.d/ and then run:

sudo exportfs -ra

to reload and add the new entries.

openocd support for Atmel SAMC 5.0V family

Atmel have released a 5.0V MCU family called SAMC which is equivalent to the normal SAMD family (there are SAMC20 and SAMC21 sets of parts). The Flash controller and other pertinent components are nearly identical so I submitted this patch for openocd which should add SAMC support to the existing SAMD driver.

This also adds config files for the SAMC Xplained Pro kits, atmel_samc20_xplained_pro.cfg and atmel_samc21_xplained_pro.cfg. These can be used directly or adapted as needed for your application. Speaking of which, what is the application for a 5.0V I/O family these days? Is it something legacy-compatible or perhaps industrial? I would love to know what folks are using these for outside of the hobby market.

Please give this patch a try if SAMC is interesting to you (I do not have the hardware to test with) and let me know if things work. This openocd Flash driver should support all of the M0+ parts including SAMD, SAMR, SAML, and SAMC now.

Simple lockfree IPC using shared memory and C11

I have an interesting Linux application that requires a process with one or more real-time threads to communicate with other processes on the system. The use case is single producer with single consumer per IPC channel.

The real-time requirements of the application (this runs on a kernel with the PREEMPT RT patch set applied) requires that the IPC channel is lock-free and presents minimal overhead. That is, a real-time producer thread or process must be able to quickly write something and keep executing without locking or context switches.

I decided to prototype a simple IPC mechanism while looking at a few techniques like userspace RCU as well. This is very much a work in progress and I am not sure how useful it really is but I am posting it here in case it is interesting or even leads to some discussion.

Shared Memory

There are several ways of implementing IPC in Linux and I chose the simple POSIX shared memory approach to represent IPC “channels”:

  1. One process (the producer) creates a POSIX shared memory object and sizes it according to the IPC API contract.
  2. Another process (the consumer) opens this shared memory object.

Both processes use mmap() to map a window into this shared memory and can now communicate:

shared memory

Stack Structure

We then need a data structure such as a stack or list that both processes can use to move data around in this window and this in turn cannot be dynamically allocated (instead is mapped into the window) and we cannot pass around pointers since each process sees the same window in its own address space. In addition, the data structure must support concurrent access.

The approach I took involves placing the data structure itself (or the “management” portion) into the start of the shared window, followed by a buffer that the data structure manages. This works just like an implementation of a stack or list that has all of its nodes allocated up front except here we also place everything in a set location in memory (or rather cast a particular memory location).

I chose a simple stack data structure and based my design on the C11 Lock free Stack written by Chris Wellons. This does use the C11 stdatomic.h primitives so GCC 4.9 or newer is required to compile and GCC should be told to be in C11 mode (for example with -std=c11). GCC switched to C11 by default in version 5.0 so that is now the default if you do not specify -std= otherwise.

This stack maintains two pointers: a head and free which point into the pool of stack nodes. Data is added by pushing onto the head (the new node comes from the free stack) and is removed by popping off the head (and thereby returning the node to the free stack). There are, however, no real pointers since we are not directly working with physical memory. In place of pointers I am using indexes or offsets which in turn resolve to a memory location when needed (that is, we manually implement base plus offset addressing). The memory window is represented as follows:

memory window

There is a one to one mapping of stack nodes to buffer pool locations (offsets) and, by knowing the starting address of the data structure, we can always calculate a pointer to some offset in the buffer pool.

Machine capabilities and limitations

This approach assumes that the machine in question is capable of compare and swap (CAS) and most of them (modern ARM, PowerPC, x86, etc) are. The C11 generic function atomic_compare_exchange_weak will generate suitable CAS instructions.

Machines do vary in the CAS size they support (among other things), that is how many bytes can be atomically compared and swapped. There is a set of Atomic lock-free constants that should be checked to determine what the machine is capable of.

The CPU I am targeting for example does not support “double CAS” and therefore cannot compare and swap a pointer and ABA counter in one shot. On the other hand, I am using base plus offset addressing rather than pointers and I have a finite number of “slots” that I need to address so the size of the “pointer” (offset) can be adjusted such that it, plus an ABA counter, can “fit” into the CAS capability of the machine.

Data Structures

The machine I am targeting can CAS four bytes so this can represent a stack “head” pointer:

    struct lfs_head {
        uint16_t inode; /* offset into the node pool */
        uint16_t count; /* ABA counter */
    };

The “nodes” themselves are nothing more than a next “pointer” which in turn is just a 16-bit value representing another offset into the node pool,

    struct lfs_node {
        uint16_t inext; /* offset into the node pool */
    };

An offset value of 0 corresponds to the 0th element so some kind of special value is needed to describe a “null pointer”, I chose:

    #define NULL_INDEX  0xFFFF

since I do not intend to support very deep stacks in this design. An index can then be checked against NULL_INDEX just like a check for NULL.

The stack “descriptor” itself contains a number of elements including the stack head and free “pointers”. It is set up at initialization and then modified as needed by push and pop operations. We would normally have pointers here to the stack node pool and buffer pool, however with the shared memory window design this is not possible and, given the fixed layout in memory, we can easily calculate those locations as offsets from the location of the descriptor (at the start of the buffer).

    typedef struct {
        size_t depth; /* Number of nodes (and corresponding data buffers) */
        size_t data_nb; /* Size of each data buffer, in bytes */
        _Atomic struct lfs_head shead, sfree; /* stack and free stack */
        _Atomic size_t size; /* Number of nodes on the stack */
    } __attribute__((packed)) lfs_t;

This descriptor implies that

  • The node pool is located at the address of this descriptor plus the size of the descriptor.
  • The Nth node is located at the address of the node pool plus the size of a node multiplied by N
  • The buffer pool is located at the address of the node pool plus the size of the node pool (which in turn is the size of a node multiplied by the depth).
  • The Nth buffer is located at the address of the buffer pool plus the size of one buffer, data_nb, multiplied by N.

API

The lock-free stack (LFS) itself must be initialized and this in turn needs to know where the descriptor is, the desired stack depth, and how big the buffers in the buffer pool are (the latter is used to calculate data pointers from the internally managed offsets).

    void lfs_init(lfs_t *lfs, size_t depth, size_t data_nb);

From there, arbitrary data can be pushed to the stack as long as there are one or more free nodes. The data is copied into the corresponding buffer and the size is known from initialization time.

    bool lfs_push(lfs_t *lfs, void *value);

Data can be popped off the stack as well:

    void *lfs_pop(lfs_t *lfs);

And the number of nodes on the stack is tracked and available:

    size_t lfs_size(lfs_t *lfs);

Implementation

The LFS internally must implement push and pop which in turn are called by the lfs_push and lfs_pop routines.

lfs_pop will:

  • pop a node from the stack
  • push that node to the free stack
  • adjust the size book keeping
  • return the buffer pool pointer corresponding to that node

      void *lfs_pop(lfs_t *lfs)
      {
          /* Pull a node off the stack */
          uint16_t inode = pop(lfs, &lfs->shead);
          if (inode == NULL_INDEX) {
              return NULL; /* There was nothing on the stack */
          } else {
              /* We pulled a node off so decrement the size */
              atomic_fetch_sub(&lfs->size, 1);
              /* Place this node onto the free stack */
              push(lfs, &lfs->sfree, inode);
              /* Resolve the corresponding buffer pointer */
              return get_node_value(lfs, inode);
          }
      }
    

lfs_push will:

  • pull a node from the free stack
  • copy data into the buffer pool location corresponding to that node
  • push that node into the stack
  • adjust the size book keeping

      bool lfs_push(lfs_t *lfs, void *value)
      {
          /* Pull a node off the free stack */
          uint16_t inode = pop(lfs, &lfs->sfree);
          if (inode == NULL_INDEX) {
              return false; /* There was no room. */
          } else {
              /* Copy the data into the buffer corresponding to this new node */
              memcpy(get_node_value(lfs, inode), value, lfs->data_nb);
              /* Place the node into the stack and increment the stack size */
              push(lfs, &lfs->shead, inode);
              atomic_fetch_add(&lfs->size, 1);
              return true;
          }
      }
    

lfs_size simply loads and returns the size field contents:

    size_t lfs_size(lfs_t *lfs)
    {
        return atomic_load(&lfs->size);
    }

Initialization

The initialization routine sets up the data_nb and depth fields that we use to calculate pointers at run time and also sets initial values for the stack heads and nodes.

    void lfs_init(lfs_t *lfs, size_t depth, size_t data_nb)
    {
        lfs->data_nb = data_nb;
        lfs->depth = depth;

The initial state for the stack head is to point to NULL (empty stack) and we set the size accordingly:

        lfs->shead.count = ATOMIC_VAR_INIT(0);
        lfs->shead.inode = ATOMIC_VAR_INIT(NULL_INDEX);
        lfs->size = ATOMIC_VAR_INIT(0);

Each node in the node pool points to the next node while the last one points to NULL.

        struct lfs_node *nodes = get_node_pool(lfs);
        /* Initialize the node pool like a linked list. */
        for (size_t i = 0; i < depth - 1; i++) {
            nodes[i].inext = i + 1;
        }
        nodes[depth - 1].inext = NULL_INDEX; /* last node */

The entire stack is free. The free head therefore points to the 0th node in the node pool:

        /* The free pool "points" to the first node in the node pool */
        lfs->sfree.inode = ATOMIC_VAR_INIT(0);
        lfs->sfree.count = ATOMIC_VAR_INIT(0);
    }

Internal Implementation

The location of the node pool can be calculated at any time by knowing the location of the descriptor and the size of the descriptor:

    static struct lfs_node *get_node_pool(lfs_t *lfs)
    {
        return (struct lfs_node *)((char *)lfs + sizeof(*lfs));
    }

We can calculate the data pointer (into the buffer pool) for a given node by using an offset from the node pool location:

    /* Resolve a node by its index and then find the corresponding pointer into the
     * buffer pool. */
    static void *get_node_value(lfs_t *lfs, uint16_t inode)
    {
        if (inode < lfs->depth) {
            return (char *)lfs + sizeof(*lfs) +
                lfs->depth * sizeof(struct lfs_node) +
                inode * lfs->data_nb;
        } else {
            return NULL;
        }
    }

The internal implementation consists of push and pop methods that make use of the stack represented by the LFS descriptor.

Lock-free push

The lock-free push implementation needs a pointer to the head it is working with (so that we can use it for both the normal and free stack while also knowing about the underlying descriptor) as well as the node we are pushing (passed by an index into the node pool). This routine must:

  • load the current head
  • build the new head which in turn:
  • has an ABA counter that is one greater than that of the original head (this is done to partially solve the ABA problem).
  • points (by index) to the node we are pushing onto the stack
  • set up the new node with a next pointer corresponding to the original head

      static void push(lfs_t *lfs, _Atomic struct lfs_head *head, uint16_t inode)
      {
          struct lfs_head next, orig = atomic_load(head);
          struct lfs_node *nodes = get_node_pool(lfs);
    
          do {
              nodes[inode].inext = orig.inode;
              next.count = orig.count + 1;
              next.inode = inode;
          } while (!atomic_compare_exchange_weak(head, &orig, next));
      }
    

Lock-free pop

The lock-free pop implementation also must know the head it is working with (again so that we can use it for both the normal and free stack) and it in turn returns the node that is removed, by index. This method must:

  • load the current head
  • build a new head that:
  • has its next pointer set to that of the original head
  • has its ABA counter set to one greater than that of the original head (again this is to partially solve the ABA problem).
  • return the removed original head

      static uint16_t pop(lfs_t *lfs, _Atomic struct lfs_head *head)
      {
          struct lfs_head next, orig = atomic_load(head);
          struct lfs_node *nodes = get_node_pool(lfs);
    
          if (orig.inode != NULL_INDEX) {
              do {
                  next.count = orig.count + 1;
                  next.inode = nodes[orig.inode].inext;
              } while (!atomic_compare_exchange_weak(head, &orig, next));
          }
    
          return orig.inode;
      }
    

Dell XPS 13 WiFi and Linux

I recently bought a Dell XPS 13 “developer edition” (basically a 2015 XPS 13 that comes with Ubuntu pre-loaded). The laptop works quite well with the exception of WiFi, for some reason Dell chose to include a Broadcom BCM4352 adapter which in turn is not supported by the mainline Linux kernel drivers. This decision to include a Broadcom WiFi device when Intel ones are available really confuses me.

The Problem

Broadcom does provide their own out of tree driver but this means that you will always need a build of that driver to match your kernel. As a result I had broken WiFi right after updating Ubuntu and when I switched to Fedora 22, there was not yet a way to build the Broadcom driver for the 4.0.1 and above kernel, and I did not want to patch their code to match up with internal API changes.

The Fix

The solution is quite simple:

  1. Purchase an Intel-based PCIE adapter. These laptops use a PCIE mini card and any Intel 802.11AC dual-band WiFi/Bluetooth adapter will plug in. There are two antennas on the laptop side.
  2. Purchase or borrow a T5 screwdriver and a small (size 0 or so) Philips screw driver and a guitar pick (or similar plastic tool) and read the first part of this excellent iFixit teardown
  3. Replace the Broadcom card with the Intel card. I found it helpful to use tweezers to plug the antennas back in.

…and then everything (including Bluetooth) will work fine. The replacement process took about ten minutes and this laptop is not difficult to take apart. I now have an Intel PCIE card rather than a Broadcom USB card (PCIE mini card connector brings out both per the standard).

Replacement card installed

For reference, the card I purchased on eBay (sub-$30) was advertised as a Dell part and described as Intel 7260NGW Wireless-AC Dual band 7260 802.11ac/a/b/g/n WiFi + BT 4.0 Card with Dell part numbers KTTYN 0KTTYN M7X42 0M7X42 however pretty much anything that matches the pictures of the card and number of antennas should work.

lspci output for the card I used:

02:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)

Low power FreeRTOS tickless mode with Atmel SAMD20 family

This post discusses a proposed tickless idle implementation for the Atmel SAMD20 Cortex M0+ MCU family. This same code should work on the SAMD21 and other related SAMD parts and it should also apply to the new power-optimized SAML21 parts. Please read this post about FreeRTOS tick suppression and Cortex M0 first – I describe a minor patch that you will need to apply there.

This is by no means the only way to implement tickless idle on these MCUs and your application may require slightly different implementation choices.

Timer setup

In this example we will use Channel 0 on the timer-counter 4 (TC4) block of the MCU. I believe that all SAMD2x and related parts have this piece of hardware and our example code will therefore be generic enough. Adjust the TC block and channel as needed (for instance you may have routed some of the counters to actual IO pins).

We must implement the FreeRTOS port method vPortSetupTimerInterrupt, this needs to set up our timer for use as a reasonably slow OS tick (replacing the standard SysTick that FreeRTOS would use otherwise). To keep things simple in this example we will use the internal ultra low power 32kHz oscillator as our clock source. We then call the ASF timer-counter API to set things up:

static struct tc_module tc;

void vPortSetupTimerInterrupt(void)
{
    struct tc_config config;

    tc_get_config_defaults(&config);
    config.clock_source     = GCLK_GENERATOR_5;
    config.counter_size     = TC_COUNTER_SIZE_32BIT;
    config.run_in_standby   = true;
    config.clock_prescaler  = TC_CLOCK_PRESCALER_DIV1;
    config.wave_generation  = TC_WAVE_GENERATION_MATCH_FREQ;

    configASSERT(tc_init(&tc, TC4, &config) == STATUS_OK);

    /* Connect to FreeRTOS tick handler */
    tc_register_callback(&tc, (tc_callback_t)xPortSysTickHandler,
        TC_CALLBACK_CC_CHANNEL0);
    tc_enable_callback(&tc, TC_CALLBACK_CC_CHANNEL0);

    /* Set up the counter */
    tc_set_count_value(&tc, 0);
    tc_set_top_value(&tc, TIMER_RELOAD_VALUE_ONE_TICK);

    /* Start */
    tc_enable(&tc);
}

At this point we should have a working FreeRTOS tick. Make sure that you consistently use milliseconds throughout your code rather than raw ticks, for example when calling vTaskDelay – that is, pass in a number of ticks needed for a certain number of milliseconds.

Tick Suppression

We need to supply FreeRTOS with a vPortSuppressTicksAndSleep method. This is called when the system is going to sleep and takes as an argument an expected idle (sleep) time in number of ticks. This method must:

  1. pause the system tick counter
  2. make sure we can really go to sleep (if we cannot, resume the tick counter and give up)
  3. arm the sleep timer by changing its settings and connecting a dummy callback (not the FreeRTOS tick callback).
  4. go to sleep
  5. on wakeup, determine how long we actually slept (it may be less than or equal to the requested idle time since we may have woken up due to some interrupt) and inform FreeRTOS by adjusting the internal tick count with vTaskStepTick.

We should perform these steps under a (partial) critical section – we want to suspend interrupts but not the scheduler itself. For convenience, we can handle resuming the “system tick” timer in a helper method, for example:

static inline void resume_system_tick(void)
{
    /* Disconnect callback */
    tc_disable_callback(&tc, TC_CALLBACK_CC_CHANNEL0);
    tc_unregister_callback(&tc, TC_CALLBACK_CC_CHANNEL0);

    /* Connect the FreeRTOS system tick callback */
    tc_register_callback(&tc, (tc_callback_t)xPortSysTickHandler,
        TC_CALLBACK_CC_CHANNEL0);
    tc_enable_callback(&tc, TC_CALLBACK_CC_CHANNEL0);

    /* Resume system tick, the starting count is taken from whatever
       is in the counter right now.  This lets us literally resume
       or otherwise pre-load the counter. */
    tc_set_top_value(&tc, TIMER_RELOAD_VALUE_ONE_TICK);
    tc_start_counter(&tc);
}

Due to the ASF TC design, we also need a dummy callback that takes the place of xPortSysTickHandler when we are in the tick suppression (sleep) state:

static void empty_cb(struct tc_module *const module_inst) { }

We also need to account for a few things:

/* Number of timer counts that make up one RTOS tick. */
#define TIMER_COUNTS_ONE_TICK   ((system_gclk_gen_get_hz(GCLK_GENERATOR_5)) / configTICK_RATE_HZ)

/* The maximum number of ticks we can suppress: that is, the number
   of ticks that fit into our 32-bit counter. */
#define MAX_SUPPRESSED_TICKS     (0xFFFFFFFFUL / (unsigned long)TIMER_COUNTS_ONE_TICK)

We now implement vPortSuppressTicksAndSleep:

void vPortSuppressTicksAndSleep(TickType_t xExpectedIdleTime)
{
    /* Make sure the expected idle time is in range */
    if (xExpectedIdleTime > MAX_SUPPRESSED_TICKS)
        xExpectedIdleTime = MAX_SUPPRESSED_TICKS;

    /* Pause the system tick timer while we reconfigure it */
    tc_stop_counter(&tc);
    /* Save the counter value at this time so that we can use it
       for tick accounting on wakeup. */
    uint32_t last_count = tc_get_count_value(&tc);
    /* Make sure that the overflow interrupt flag is cleared, we
       will look for an overflow on wakeup so we can't start
       with one. */
    tc.hw->COUNT32.INTFLAG.bit.OVF = 1;

    /* Enter critical section */
    portDISABLE_INTERRUPTS();
    __asm volatile("dsb");
    __asm volatile("isb");

    switch (eTaskConfirmSleepModeStatus()) {
        case eAbortSleep: /* Never mind, back to system tick */
            resume_system_tick();
            break;

        /* We are going to sleep indefinitely, an interrupt will
           wake us up.  In this implementation we do not have a
           way to count how long we slept if we were to actually
           do that, so we will treat this like eStandardSleep with
           maximum sleep time instead. */
        case eNoTasksWaitingTimeout:
            xExpectedIdleTime = MAX_SUPPRESSED_TICKS;
            /* fall through... */
        /* We are going to sleep for the specified amount of time
           (via wakeup alarm) or until another interrupt wakes us
           up. */
        case eStandardSleep:
            /* Configure desired low-power state.  This is up to
               you and you may want to pick a state based on resume
               time versus the expected sleep time or other factors.
               See the ASF documentation for details about
               SYSTEM_SLEEPMODE_IDLE_2 for instance */
            system_set_sleepmode(SYSTEM_SLEEPMODE_STANDBY);

            /* Disconnect the system tick */
            tc_disable_callback(&tc, TC_CALLBACK_CC_CHANNEL0);
            tc_unregister_callback(&tc, TC_CALLBACK_CC_CHANNEL0);

            /* Connect the dummy callback */
            tc_register_callback(&tc, empty_cb,
                TC_CALLBACK_CC_CHANNEL0);
            tc_enable_callback(&tc, TC_CALLBACK_CC_CHANNEL0);

            /* Set our wakeup alarm */
            tc_set_count_value(&tc, 0);
            tc_set_top_value(&tc,
                (xExpectedIdleTime * TIMER_COUNTS_ONE_TICK) - 1));
            
            /* Start timer */
            tc_start_counter(&tc);

            /* Enter low power sleep mode (ie: wfi) */
            system_sleep();

            /* We just woke up.  How long did we sleep for? */

            /* A counter overflow means that we slept for at
               least the expected time, so we can go ahead
               and adjust the RTOS tick count by that amount
               right now. */
            if (tc.hw->COUNT32.INTFLAG.bit.OVF)
                vTaskStepTick(xExpectedIdleTime);

            /* We must also adjust for time left over (if any) */
            uint32_t cur_count = (last_count +
                tc_get_count_value(&tc));
            vTaskStepTick(cur_count / TIMER_COUNTS_ONE_TICK);
                TIMER_COUNTS_ONE_TICK);
            
            /* Resume the system tick, starting at approximately the
               remainder of time until the next tick (that is, what
               we could not account for above) */
            tc_set_count_value(&tc,
                cur_count % TIMER_COUNTS_ONE_TICK);
            resume_system_tick();
            break;
    }

    /* Leave critical section */
    portENABLE_INTERRUPTS();
    __asm volatile("dsb");
    __asm volatile("isb");
}

Please note that I am handling eNoTasksWaitingTimeout as if it is eStandardSleep (with maximum sleep time) in the above implementation. This ensures that we account for sleep time with just our simple timer-counter block. A different MCU (or wakeup source) could allow us to implement eNoTasksWaitingTimeout as an indefinite sleep like you see in the example in the FreeRTOS documentation. With a slow enough tick I feel that this is not a bad design decision – the MCU will sleep for quite a long time before waking up and returning to sleep in the eNoTasksWaitingTimeout case.

comments powered by Disqus