Increase the number of open files limit for the sandboxed process to the
maximum allowed in the system. The maximum is obtained by reading the
/proc/sys/fs/nr_open sysctl file, and the setting is done using the setrlimit
syscall. Failure to read or parse the nr_open file, or to set the rlimit
results in a panic.
Signed-off-by: Ricardo Koller <ricarkol@gmail.com>
The binary is still built in the same location but the source code and
the dependencies for it come from the vhost_user_fs crate itself.
The binary will be built with:
`cargo build --all --bin vhost_user_fs` or just `cargo build --all`
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Split the generic virtio code (queues and device type) from the
VirtioDevice trait, transport and device implementations.
This also simplifies the feature handling in vhost_user_backend as the
vm-virtio crate is no longer has any features.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This corresponds to QEMU's 63659fe74e76f5c52854 commit.
the setattr code uses both fchmod and fchmodat in different cases,
however we only had fchmodat in the whitelist.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
This is a preparing commit to build and test CH on AArch64. All building
issues were fixed, but no functionality was introduced.
For X86, the logic of code was not changed at all.
For ARM, the architecture specific part is still empty. And we applied
some tricks to workaround lint warnings. But such code will be replaced
later by other commits with real functionality.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Add missing syscall used by the musl build.
TEST=scripts/dev_cli.sh tests --libc musl --integration -- vhost_user_fs_daemon
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Implement seccomp; we use one filter for all threads.
The syscall list comes from the C daemon with syscalls added
as I hit them.
The default behaviour is to kill the process, this normally gets
audit logged.
--seccomp none disables seccomp
log Just logs violations but doesn't stop it
trap causes a signal to be be sent that can be trapped.
If you suspect you're hitting a seccomp action then you can
check the audit log; you could also switch to running with 'log'
to collect a bunch of calls to report.
To see where the syscalls are coming from use 'trap' with a debugger
or coredump to backtrace it.
This can be improved for some syscalls to restrict the parameters
to some syscalls to make them more restrictive.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Implement support for setting up a sandbox for running the
service. The technique for this has been borrowed from virtiofsd, and
consists on switching to new PID, mount and network namespaces, and
then switching root to the directory to be shared.
Future patches will implement additional hardening features like
dropping capabilities and seccomp filters.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Allow callers to provide a file descriptor for /proc/self/fd. This is
useful for sandboxing, as we may be running in a namespace that
doesn't have access to /proc.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Open a file descriptor to /proc/self/fd instead of /proc. We aren't
using any other entries from that directory, and doing this allows us
to keep working even if /proc is no longer present in our
namespace (useful for sandboxing).
Signed-off-by: Sergio Lopez <slp@redhat.com>
Add the WRITE_KILL_PRIV write flag, corresponding to
FUSE_WRITE_KILL_PRIV introduced in 7.31, and use to only remove the
setuid and setgid bits (by switching credentials) conditionally.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Add support for MAX_PAGES, corresonding to FUSE_MAX_PAGES introduced
in FUSE 7.28.
This allows us to negotiate with the kernel the maximum number of
pages that we support to transfer in a single request.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Add support for FOPEN_CACHE_DIR, a flag that allows us to tell the
guest that it's safe to cache a directory, introduced in FUSE 7.28.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Virtio-fs daemon expects fs_slave_io() returns the number of bytes
read/written on success, but we always return 0 and make userspace think
nothing has been read/written.
Fix it by returning the actual bytes read/written. Note that This
depends on the corresponding fix in vhost crate.
Fixes: #949
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Implement missing support for FUSE_LSEEK, which basically implies
calling to libc::lseek on the file handle. As this operation alters
the file offset, we take a write lock on the File's RwLock.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Extended attributes (xattr) support has a huge impact on write
performance. The reason for this is that, if enabled, FUSE sends a
setxattr request after each write operation, and due to the inode
locking inside the kernel during said request, the ability to execute
the operations in parallel becomes heavily limited.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Replace HandleData's File Mutex with a RwLock to have more granularity
on the lock. This allows operations on the same File that are safe to
be run in parallel (at this moment, read and write), to acquire a read
lock to avoid waiting on each other.
Signed-off-by: Sergio Lopez <slp@redhat.com>
As cloud-hypervisor/vhost crate (dragonball branch) is ready to be used,
switch vhost_rs from internal crate to the external one.
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
This introduces setupmapping and removemapping methods to server.rs,
passthrough.rs and filesystem.rs in order to support virtiofs dax mode
inside guest.
Since we don't really want the server.rs to know that it is dealing with
vhost-user specifically, this is making it more generic by adding a new
trait which has three functions map()/unmap()/sync() corresponding to
fs_slave_{map, unmap, sync}, server.rs will take anything that implements
the trait.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Because the vhost_user_backend crate needs some changes to support
moving the process to a different mount namespace and perform a pivot
root, it is not possible to change '/' to the given shared directory.
This commit, as a temporary measure, let the code point at the given
shared directory.
The long term solution is to perform the mount namespace change and the
pivot root as this will provide greater security.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add a Server type that links the FUSE protocol with the virtio
transport. It parses messages sent on the virtio queue and then
calls the appropriate method of the Filesystem trait.
This code has been ported over from crosvm commit
961461350c0b6824e5f20655031bf6c6bf6b7c30.
One small modification has been applied to the original code. Because
cloud-hypervisor didn't have the macro used by crosvm, the match
statement in the function handle_message() has been updated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introduce helpers to split a virtio descriptor into its readable part on
one side, and into its writable part on the other side. This is useful
to separate the request from the reply.
This code has been ported over from crosvm commit
961461350c0b6824e5f20655031bf6c6bf6b7c30.
Two important modifications have been applied to the original code:
- GuestMemory is replaced by GuestMemoryMmap from the vm-memory crate,
which comes with different ways of accessing the memory regions.
- VolatileSlice has different methods, which means the code has been
updated accordingly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>